N threads read N+1 elements: Coalesced possible?

My kernel hast N threads and it must read an array of N+1 floast on each invocation. The array of N+1 floats is allocated in global memory with MemAllocPitch to meet the alignment requirements. The problem is that I am not sure how to get coalesced access with this pattern. So far my code dos the following:

Thread i loads element i from the array into shared memory (i from 0 to N-1).
If i==N-1 then also load element N.

Is the last thread breaking the coalesced memory access? If so, would a texture-based access pattern improve performance?

If I were you I would let thread 0 read the extra value. Letting it be read by the last thread is sure to break the coalescing rules.

But why? To me having thread 0 read element 0 and 3 (let’s N=3 in the next example) is breaking coalesced access for sure, right?

t0 -> array[0]

t1 -> array[1]

t2 -> array[2]

t2 -> array[3]

t0 -> array[0]

t0 -> array[3]

t1 -> array[1]

t2 -> array[2]

But in reality it is running differently. Let’s assume 32 threads :

  • threads [0 31] read value [0 31]
  • thread 0 reads value 32

I was actually thinking of memory-bank conflicts in shared memory. There you want to have each thread of a warp accessing a different bank. I think that in your case letting the first or last thread do the read makes probably no difference. It is just that there might be an advantage using threadIdx.x==0, if I remember good someone from NVidia once explained something about divergent warps.

Mmmm. That might be interesting but I have not found the reference. What kind of advantage is that?

Don’t worry about bank conflicts when reading from global->shared. Bank conflicts really only come into play when you read/write shared mem often in an inner loop w/o global reads. The latency of the global memory read is so high that the cost of a puny little bank conflict is of no consequence and in practice is hidden along with the latency.

Also, I wouldn’t worry about divergent warps when doing the if (threadIdx.x == 0). Again, this is a slow global memory read so the “slow” warp divergence is hidden the same as the global mem latency.

Finally, as DenisR said, you MUST have threadIdx.x == 0 (well threadIdx.x a multiple of 32) read the n+1 value. Read the half-warp base address coalescing rule in the programming guide for details.

I think that you mean (threadIdx.x+1) a multiple of 32, right?

In any case, I have tried both schemes (I repeat them below for convenience) and I am not getting any difference in performance at all. My block size in the x dimension is 16 so that might influence the similarity in performance.

t0 -> array[0]

t1 -> array[1]

t2 -> array[2]

t2 -> array[3]

t0 -> array[0]

t0 -> array[3]

t1 -> array[1]

t2 -> array[2]

Any suggestion?

Well if the performance is the same, I would use the threadIdx.x==0 version (because it looks like all examples)

And it is (just to stress the ordering, it is sometimes important to realize):
t0 -> array[0]
t1 -> array[1]
t2 -> array[2]
t2 -> array[3]

t0 -> array[0]
t1 -> array[1]
t2 -> array[2]
t0 -> array[3]

I would have expected a noticeable difference. I’ll write a little benchmark tomorrow to see for myself.

OK, my own tests confirm: there is no noticeable difference. I get ~58 GiB/s effective bandwidth no matter which thread performs the 1 extra read. Perhaps this is because performing the coalesced read still reads in the next +15 elements and that bandwidth is lost because those values are not used.

I was able to reach ~63 GiB/s by reading the first n values with a coalesced read and reading the +1 value using tex1Dfetch. That’s a slight improvement. I’m not sure how to improve things further, unless your array is a small one that could be read from constant memory.

No, I definitely mean threadIdx.x == a multiple of 32. If thread 1 in a warp reads a base address on a multiple of 256 bits the read is not coalesced. You need thread 0 in the warp to perform this read for it to be coalesced (though it doesn’t seem to make a difference for some reason as I said above). I confirmed this with the profiler counters btw.

Thank you for checking. I am actually accessing a 2D matrix stored by rows and of size (N+1, N+1), being the block size (N, N). Just in case someone needs it, what I do is:

  • Each thread with threadIdx.x=N-1 reads matrix elements (N, threadIdx.y) and (threadIdx.x, N+1).

-Each thread with threadIdx.y=N-1 reads matrix elements (threadIdx.x,N) and (threadIdx.x,N+1).

-I don’t read element (N+1,N+1) because I don’t need it.

I cannot think of any method more efficient (except maybe for the elements with threadIdx.y=N-1 which represent the last line of the matrix).