Coalesced memory access in a matrix of coefficients

asterix_obelix · August 4, 2024, 6:53am

Hi, I have an equation that boils down to a = T*b + T2*c + T3*d + e Here, T, T2, and T3 are constants that don’t vary across threads. The coefficients b, c, d, and e vary from one thread to another. The coefficients need (for various reasons) to be stored in an array, X, of size 4*N, where N is the number of data points for which I solve the equation. The order of storing the coefficients is flexible.

My questions is, what would give me the more efficient memory access? Would it be -

a = T * X[threadIdx.x] + T2 * X[N + threadIdx.x] + T3 * X[2*N + threadIdx.x] + X[3*N + threadIdx.x];

or

a = T * X[4*threadIdx.x] + T2 * X[4*threadIdx.x + 1] + T3 * X[4*threadIdx.x + 2] + X[4*threadIdx.x + 3];

The reason I am confused is because the second approach gives me coalesced access within a thread, but the access would be strided across threads, and that goes against everything I have learnt about programming GPUs, i.e., memory access across threads should be coalesced. I basically don’t understand if the compiler will load a coefficient, e.g., X[threadIdx.x], for every thread followed by another coefficient, or will it load all the coefficients for one thread followed by coefficients of the next thread.

striker159 · August 4, 2024, 9:33am

First of all, I think the second approach should be
a = T * X[4*threadIdx.x] + T2 * X[4*threadIdx.x + 1] + T3 * X[4*threadIdx.x + 2] + X[4*threadIdx.x + 3];

Your code looks simple enough that you could test out both variants to see which is faster.
Memory coalescing is determined by the number of required memory transactions for the whole warp. This is explained in the CUDA Best Practices guide CUDA C++ Best Practices Guide

Curefab · August 4, 2024, 3:01pm

The second access and also the one cited by @striker159 can be executed with a vector read instruction. The line is probably compiled into a vector instruction. So both lines would lead to coalesced accesses (at least in some sense, not sure, if it fulfills the strict definition of the word, but the effect on memory accesses).

asterix_obelix · August 4, 2024, 5:16pm

Thanks for the edit, I have fixed it in the original question

asterix_obelix · August 15, 2024, 5:18pm

For anyone interested, in my case, the first approach turned out to be faster, and had higher L1 cache hits

Curefab · August 15, 2024, 7:27pm

Interesting, thank you.
Could you also try it with a float4 (or int4 or whatever data type you are using) type for the array?

__device__ f(float4* X) {
    b = X[threadIdx.x];
    a = T * b.x + T2 * b.y + T3 * b.z + b.w;
}

That makes (more) sure that the memory access is vectorized.
From the actual memory accessed, it would be like the second line.

Topic		Replies	Views
Best way to coalesce memory access CUDA Programming and Performance	1	2408	July 11, 2009
Whether this is coalescing access several cases to decide CUDA Programming and Performance	0	1569	August 2, 2011
Uncoalesced on matrix by vector multiplication CUDA Programming and Performance	3	8004	June 24, 2009
Memory Coalescing CUDA Programming and Performance	5	9299	October 15, 2011
N threads read N+1 elements: Coalesced possible? CUDA Programming and Performance	10	4121	March 11, 2008
Coalesced Memory access related doubt CUDA Programming and Performance	13	2079	December 9, 2010
Coalescence CUDA Programming and Performance	3	777	January 9, 2018
Interpretation of Coalesced Global memory access for 3d Block Is it coalesced only if tid is used?? CUDA Programming and Performance	2	3181	November 23, 2011
Memory coalescing CUDA Programming and Performance	0	8394	June 10, 2007
Memory coalescing in one thread CUDA Programming and Performance	17	16657	March 31, 2011

Coalesced memory access in a matrix of coefficients

Related topics