N threads read N+1 elements: Coalesced possible?

Alvarito · March 8, 2008, 5:46pm

My kernel hast N threads and it must read an array of N+1 floast on each invocation. The array of N+1 floats is allocated in global memory with MemAllocPitch to meet the alignment requirements. The problem is that I am not sure how to get coalesced access with this pattern. So far my code dos the following:

Thread i loads element i from the array into shared memory (i from 0 to N-1).
If i==N-1 then also load element N.

Is the last thread breaking the coalesced memory access? If so, would a texture-based access pattern improve performance?

DenisR · March 8, 2008, 7:11pm

If I were you I would let thread 0 read the extra value. Letting it be read by the last thread is sure to break the coalescing rules.

Alvarito · March 9, 2008, 10:50pm

But why? To me having thread 0 read element 0 and 3 (let’s N=3 in the next example) is breaking coalesced access for sure, right?

t0 → array[0]

t1 → array[1]

t2 → array[2]

t2 → array[3]

t0 → array[0]

t0 → array[3]

t1 → array[1]

t2 → array[2]

DenisR · March 10, 2008, 6:54am

But in reality it is running differently. Let’s assume 32 threads :

threads [0 31] read value [0 31]
thread 0 reads value 32

I was actually thinking of memory-bank conflicts in shared memory. There you want to have each thread of a warp accessing a different bank. I think that in your case letting the first or last thread do the read makes probably no difference. It is just that there might be an advantage using threadIdx.x==0, if I remember good someone from NVidia once explained something about divergent warps.

Alvarito · March 10, 2008, 5:04pm

Mmmm. That might be interesting but I have not found the reference. What kind of advantage is that?

MisterAnderson42 · March 10, 2008, 5:19pm

Don’t worry about bank conflicts when reading from global->shared. Bank conflicts really only come into play when you read/write shared mem often in an inner loop w/o global reads. The latency of the global memory read is so high that the cost of a puny little bank conflict is of no consequence and in practice is hidden along with the latency.

Also, I wouldn’t worry about divergent warps when doing the if (threadIdx.x == 0). Again, this is a slow global memory read so the “slow” warp divergence is hidden the same as the global mem latency.

Finally, as DenisR said, you MUST have threadIdx.x == 0 (well threadIdx.x a multiple of 32) read the n+1 value. Read the half-warp base address coalescing rule in the programming guide for details.

Alvarito · March 10, 2008, 5:42pm

I think that you mean (threadIdx.x+1) a multiple of 32, right?

In any case, I have tried both schemes (I repeat them below for convenience) and I am not getting any difference in performance at all. My block size in the x dimension is 16 so that might influence the similarity in performance.

t0 → array[0]

t1 → array[1]

t2 → array[2]

t2 → array[3]

t0 → array[0]

t0 → array[3]

t1 → array[1]

t2 → array[2]

Any suggestion?

DenisR · March 10, 2008, 7:04pm

Well if the performance is the same, I would use the threadIdx.x==0 version (because it looks like all examples)

And it is (just to stress the ordering, it is sometimes important to realize):
t0 → array[0]
t1 → array[1]
t2 → array[2]
t2 → array[3]

t0 → array[0]
t1 → array[1]
t2 → array[2]
t0 → array[3]

MisterAnderson42 · March 10, 2008, 10:47pm

I would have expected a noticeable difference. I’ll write a little benchmark tomorrow to see for myself.

MisterAnderson42 · March 11, 2008, 3:16pm

OK, my own tests confirm: there is no noticeable difference. I get ~58 GiB/s effective bandwidth no matter which thread performs the 1 extra read. Perhaps this is because performing the coalesced read still reads in the next +15 elements and that bandwidth is lost because those values are not used.

I was able to reach ~63 GiB/s by reading the first n values with a coalesced read and reading the +1 value using tex1Dfetch. That’s a slight improvement. I’m not sure how to improve things further, unless your array is a small one that could be read from constant memory.

No, I definitely mean threadIdx.x == a multiple of 32. If thread 1 in a warp reads a base address on a multiple of 256 bits the read is not coalesced. You need thread 0 in the warp to perform this read for it to be coalesced (though it doesn’t seem to make a difference for some reason as I said above). I confirmed this with the profiler counters btw.

Alvarito · March 11, 2008, 3:43pm

Thank you for checking. I am actually accessing a 2D matrix stored by rows and of size (N+1, N+1), being the block size (N, N). Just in case someone needs it, what I do is:

Each thread with threadIdx.x=N-1 reads matrix elements (N, threadIdx.y) and (threadIdx.x, N+1).

-Each thread with threadIdx.y=N-1 reads matrix elements (threadIdx.x,N) and (threadIdx.x,N+1).

-I don’t read element (N+1,N+1) because I don’t need it.

I cannot think of any method more efficient (except maybe for the elements with threadIdx.y=N-1 which represent the last line of the matrix).

Topic		Replies	Views
question about texture reads and coalescading reads CUDA Programming and Performance	3	1853	December 12, 2008
Shared memory question CUDA Programming and Performance	27	7418	June 23, 2008
coalescing problem CUDA Programming and Performance	4	1077	August 8, 2011
Whether this is coalescing access several cases to decide CUDA Programming and Performance	0	1569	August 2, 2011
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11770	March 20, 2009
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5508	March 6, 2007
Coalesced Access Is this coalesced? CUDA Programming and Performance	7	2593	June 17, 2008
Coalesced Memory access related doubt CUDA Programming and Performance	13	2079	December 9, 2010
Another question about coalesced reads/writes CUDA Programming and Performance	10	2183	August 18, 2009
Memory coalescing in one thread CUDA Programming and Performance	17	16657	March 31, 2011

N threads read N+1 elements: Coalesced possible?

Related topics