I have two pieces of code, the performance has huge different but I can not figure out the reason:(
d_data1 and d_data2 are device arrays:
global void add( int* d_data1,int* d_data2 ) // the first sample
{
int tid = threadIdx.x;
int bid = blockIdx.x;
int off = (bid<<9)+tid; // 512 threads each block
int k = d_data1[off];
if( k !=0 ) // when the element in d_data1 is not 0, add it with d_data2
d_data2[off]+=k;
}
global void add( int* d_data1,int* d_data2 ) // the second sample
{
int tid = threadIdx.x;
int bid = blockIdx.x;
int off = (bid<<9)+tid; // 512 threads each block
int k = d_data1[off];
if( k !=0 ) // when the element in d_data1 is not 0, add it with d_data2
d_data2[off-k]+=k; // the only difference is instead of choosing d_data2[off]
// we choose d_data2[off-k]
}
For the second one, we already know that when k!=0, [off-k] is increasing sequentially, that means we will read d_data2[0,1,2,3,4,…], without conflict.
the second one will take 5 times more than the first one :(
someone can tell me why??
thanks in advance