There must be a reason, but I can not figure out:(

I have two pieces of code, the performance has huge different but I can not figure out the reason:(

d_data1 and d_data2 are device arrays:

global void add( int* d_data1,int* d_data2 ) // the first sample
{
int tid = threadIdx.x;
int bid = blockIdx.x;
int off = (bid<<9)+tid; // 512 threads each block
int k = d_data1[off];
if( k !=0 ) // when the element in d_data1 is not 0, add it with d_data2
d_data2[off]+=k;
}

global void add( int* d_data1,int* d_data2 ) // the second sample
{
int tid = threadIdx.x;
int bid = blockIdx.x;
int off = (bid<<9)+tid; // 512 threads each block
int k = d_data1[off];
if( k !=0 ) // when the element in d_data1 is not 0, add it with d_data2
d_data2[off-k]+=k; // the only difference is instead of choosing d_data2[off]
// we choose d_data2[off-k]
}

For the second one, we already know that when k!=0, [off-k] is increasing sequentially, that means we will read d_data2[0,1,2,3,4,…], without conflict.
the second one will take 5 times more than the first one :(

someone can tell me why??

thanks in advance

Please don’t post the same question to different threads.

sorry for that。。。