There must be a reason, but I can not figure out:(

I have two pieces of code, the performance has huge different but I can not figure out the reason:(

d_data1 and d_data2 are device arrays:

global void add( int* d_data1,int* d_data2 ) // the first sample
int tid = threadIdx.x;
int bid = blockIdx.x;
int off = (bid<<9)+tid; // 512 threads each block
int k = d_data1[off];
if( k !=0 ) // when the element in d_data1 is not 0, add it with d_data2

global void add( int* d_data1,int* d_data2 ) // the second sample
int tid = threadIdx.x;
int bid = blockIdx.x;
int off = (bid<<9)+tid; // 512 threads each block
int k = d_data1[off];
if( k !=0 ) // when the element in d_data1 is not 0, add it with d_data2
d_data2[off-k]+=k; // the only difference is instead of choosing d_data2[off]
// we choose d_data2[off-k]

For the second one, we already know that when k!=0, [off-k] is increasing sequentially, that means we will read d_data2[0,1,2,3,4,…], without conflict.
the second one will take 5 times more than the first one :(

someone can tell me why??

thanks in advance

If k is not the same for all threads in the warp, then in general your reads in the second case will not be coalesced. Can you say more about the contents of data1?

But can the read be coalesced in the first case? The elemnets in data1 are like this:

index 0 1 2 3 4 5 6 7 8 9 10 …

number 0 0 2 0 3 0 0 5 0 6 0 …

that is: some of the data1 are 0s; some are non-zero numbers; the non-zero numbers are in increasing order; and there is another rule: the non-zero numbers minus its index are 0, 1, 2, 3 , 4, …

another idea to coalesced read it?

In the second case, the loads and stores are to consecutive memory locations. BUt the problem is they may NOT be from consecutive threads == which breaks the coalescing rule. So, I would guess coalescing does NOT happen in second case.

In the first case, the presence of zeroes makes sure that non-consecutive locations are loaded and stored. However, the address accesses satisfy the “HalfWarpBaseAddress + TID” rule for coalescing to happen. Check out figure 5-1 (pg 66 of 143) of the CUDA 1.1 manual.

CUDA manual says:

SO, I wonder why you get good performance with the second one.

Are you d_data1 and d_data2 aligned at the correct boundaries? Generally, if u use cudaMalloc() it will make sure that the addresses are 256-byte aligned.


Aaaah… I dont understand which works faster from your post. I would expect the first one to work faster as it has meory coalescing.

The second one does NOT satisfy colaescing rule. SO, it is bound to be very slow.

LEt me paste the full coalescing rule from the manual:

Yes, Sarnath is right. Second example breaks coalescing when reading and writing d_data2.

Yes, the second one is much slower…

Can someone tell me more details about the execution of the first and second one? for example, how does a warp or half a warp executes the same piece of code? for the second one I think one warp can do it at the same time because there is no conflict in reads and writes between different threads. why not?

Thanks, I think your replies make me know it much better :)

Consider your data:

index 0 1 2 3 4 5 6 7 8 9 10 …

number 0 0 2 0 3 0 0 5 0 6 0 …

Consider just a single WARP.

In the first case:

Thread 0 – No access to global memory

Thread 1 – No access to global memory

Thread 2 – Access “(int*) d_data + 2”

Thread 3 – No access

Thread 4 – Access “(int *) d_data + 4”

Thread 5 – No access

Thread 6 – No access

Thread 7 – Access “(int *) d_data + 7”

Thread 8 – No access.

Thread 9 – Access “(int *)d_data + 9”

and so on…

Now, in practice all the threads in the WARP execute simultaneously and all the memory requests are placed simultaneously. The best case is when all 32 threads accesses 32 consecutive data-types starting from a data-type-aligned-address. IN your case 32 integers.

Thus if you consider base-address as “X”, then threadID “Y” will access the integer pointer “(int *)X + Y”. This is the requirement for coalescing to happen. Now, even if out of the 32 threads , a few threads do NOT access memory at all --> the coalescing would still happen.

In the first case – if you observer threadID Y, accesses “d_data + Y” – which favour coalescing.

If you work out the same for the 2nd case – you will find that it does NOT work that way and causes memory coalescing to fail.


What is a half-warp? – A multiprocessor has 8 CPUs. Each of these CPUs are fed by an instruction unit that runs at 1/4th speed of the CPU. Thus the CPU can execute the same instruction 4 times before starting hte next instruction. Thus an instruction can be executed for a total of 32 times. However, each of these 32 times, the code is executed for a uniqute thread inside a WARP. That is why the WARP size is 32.

Now, when an instruction accessing memory is executed by a WARP, it is executed in 4 cycles of 8 threads each. Now, the first 2 cycles and second two cycles are called as “half-warps” – first-half and second-half. Currently global memory coalescing happens at half-warp level. In future, this can move to the full-warp. i.e. even if all 32 threads are accessing memory – the coalescing first happens for the first 16 threads followed by next 16.

Hope this helped you in some way

thanks, it is very helpful :)