Why the two coalesced memory write vary so much?

This problem just bothers me for a long time…

I have a float2 array which is allocated by cudaMalloc, then I tried to write some values to that array, the blocksize is 4, and threads per block is 256:

[codebox]int j = blockDim.x * blockIdx.x + threadIdx.x;

data[j] = a0;

data[j+4*256] = a1;

j = ((blockIdx.x / 2)*4 + blockIdx.x % 2)*256 + threadIdx.x;

data[j] = a0;

data[j+2*256] = a1;[/codebox]

I thought the execution time of these two ways should be the same, but it appears that the first code segment runs 10 times faster than the second one, does anyone knows why this could happen? I thought the second code snippet is also coalesced…

BTW, I am using 9600GT with CUDA 2.3. Thanks in advance.

For first sight seems that only partion camping may arise here.
What does profiler your say?
what does happen if you change data[j+4256] = a1; to data[j+2256] = a1; in first kernel?

I saw in the profiler that the second code snippet have a larger branch than the first code snippet, which causes the difference, can anyone tell me how could this happen?