This problem just bothers me for a long time…
I have a float2 array which is allocated by cudaMalloc, then I tried to write some values to that array, the blocksize is 4, and threads per block is 256:
[codebox]int j = blockDim.x * blockIdx.x + threadIdx.x;
data[j] = a0;
data[j+4*256] = a1;
j = ((blockIdx.x / 2)*4 + blockIdx.x % 2)*256 + threadIdx.x;
data[j] = a0;
data[j+2*256] = a1;[/codebox]
I thought the execution time of these two ways should be the same, but it appears that the first code segment runs 10 times faster than the second one, does anyone knows why this could happen? I thought the second code snippet is also coalesced…
BTW, I am using 9600GT with CUDA 2.3. Thanks in advance.