My kernel hast N threads and it must read an array of N+1 floast on each invocation. The array of N+1 floats is allocated in global memory with MemAllocPitch to meet the alignment requirements. The problem is that I am not sure how to get coalesced access with this pattern. So far my code dos the following:
Thread i loads element i from the array into shared memory (i from 0 to N-1).
If i==N-1 then also load element N.
Is the last thread breaking the coalesced memory access? If so, would a texture-based access pattern improve performance?
But in reality it is running differently. Let’s assume 32 threads :
threads [0 31] read value [0 31]
thread 0 reads value 32
I was actually thinking of memory-bank conflicts in shared memory. There you want to have each thread of a warp accessing a different bank. I think that in your case letting the first or last thread do the read makes probably no difference. It is just that there might be an advantage using threadIdx.x==0, if I remember good someone from NVidia once explained something about divergent warps.
Don’t worry about bank conflicts when reading from global->shared. Bank conflicts really only come into play when you read/write shared mem often in an inner loop w/o global reads. The latency of the global memory read is so high that the cost of a puny little bank conflict is of no consequence and in practice is hidden along with the latency.
Also, I wouldn’t worry about divergent warps when doing the if (threadIdx.x == 0). Again, this is a slow global memory read so the “slow” warp divergence is hidden the same as the global mem latency.
Finally, as DenisR said, you MUST have threadIdx.x == 0 (well threadIdx.x a multiple of 32) read the n+1 value. Read the half-warp base address coalescing rule in the programming guide for details.
I think that you mean (threadIdx.x+1) a multiple of 32, right?
In any case, I have tried both schemes (I repeat them below for convenience) and I am not getting any difference in performance at all. My block size in the x dimension is 16 so that might influence the similarity in performance.
OK, my own tests confirm: there is no noticeable difference. I get ~58 GiB/s effective bandwidth no matter which thread performs the 1 extra read. Perhaps this is because performing the coalesced read still reads in the next +15 elements and that bandwidth is lost because those values are not used.
I was able to reach ~63 GiB/s by reading the first n values with a coalesced read and reading the +1 value using tex1Dfetch. That’s a slight improvement. I’m not sure how to improve things further, unless your array is a small one that could be read from constant memory.
No, I definitely mean threadIdx.x == a multiple of 32. If thread 1 in a warp reads a base address on a multiple of 256 bits the read is not coalesced. You need thread 0 in the warp to perform this read for it to be coalesced (though it doesn’t seem to make a difference for some reason as I said above). I confirmed this with the profiler counters btw.
Thank you for checking. I am actually accessing a 2D matrix stored by rows and of size (N+1, N+1), being the block size (N, N). Just in case someone needs it, what I do is:
Each thread with threadIdx.x=N-1 reads matrix elements (N, threadIdx.y) and (threadIdx.x, N+1).
-Each thread with threadIdx.y=N-1 reads matrix elements (threadIdx.x,N) and (threadIdx.x,N+1).
-I don’t read element (N+1,N+1) because I don’t need it.
I cannot think of any method more efficient (except maybe for the elements with threadIdx.y=N-1 which represent the last line of the matrix).