Coalesced memory access in a matrix of coefficients

Hi, I have an equation that boils down to a = T*b + T2*c + T3*d + e Here, T, T2, and T3 are constants that don’t vary across threads. The coefficients b, c, d, and e vary from one thread to another. The coefficients need (for various reasons) to be stored in an array, X, of size 4*N, where N is the number of data points for which I solve the equation. The order of storing the coefficients is flexible.

My questions is, what would give me the more efficient memory access? Would it be -

a = T * X[threadIdx.x] + T2 * X[N + threadIdx.x] + T3 * X[2*N + threadIdx.x] + X[3*N + threadIdx.x];

or

a = T * X[4*threadIdx.x] + T2 * X[4*threadIdx.x + 1] + T3 * X[4*threadIdx.x + 2] + X[4*threadIdx.x + 3];

The reason I am confused is because the second approach gives me coalesced access within a thread, but the access would be strided across threads, and that goes against everything I have learnt about programming GPUs, i.e., memory access across threads should be coalesced. I basically don’t understand if the compiler will load a coefficient, e.g., X[threadIdx.x], for every thread followed by another coefficient, or will it load all the coefficients for one thread followed by coefficients of the next thread.

First of all, I think the second approach should be
a = T * X[4*threadIdx.x] + T2 * X[4*threadIdx.x + 1] + T3 * X[4*threadIdx.x + 2] + X[4*threadIdx.x + 3];

Your code looks simple enough that you could test out both variants to see which is faster.
Memory coalescing is determined by the number of required memory transactions for the whole warp. This is explained in the CUDA Best Practices guide CUDA C++ Best Practices Guide

The second access and also the one cited by @striker159 can be executed with a vector read instruction. The line is probably compiled into a vector instruction. So both lines would lead to coalesced accesses (at least in some sense, not sure, if it fulfills the strict definition of the word, but the effect on memory accesses).

Thanks for the edit, I have fixed it in the original question

For anyone interested, in my case, the first approach turned out to be faster, and had higher L1 cache hits

Interesting, thank you.
Could you also try it with a float4 (or int4 or whatever data type you are using) type for the array?

__device__ f(float4* X) {
    b = X[threadIdx.x];
    a = T * b.x + T2 * b.y + T3 * b.z + b.w;
}

That makes (more) sure that the memory access is vectorized.
From the actual memory accessed, it would be like the second line.