Hi, everyone
first look at my code:[codebox]global void item_Count(int *dk_buffer, …)
{
extern __shared__ int tran_buffer[];
extern __shared__ int item_cnt[]; // item_cnt[3*blockDim.x]
int* item_id = &item_cnt[blockDim.x];
int* item_cnt_rec = &item_id[blockDim.x];
unsigned int item_idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int count = 0, address = 0, tran_begin = 0, tran_sz;
unsigned long i;
unsigned int j, index, index_end;
for(i = 0; i < 232; i++) {
tran_buffer[threadIdx.x] = dk_buffer[address + threadIdx.x];
tran_buffer[blockDim.x+threadIdx.x] = dk_buffer[address + blockDim.x + threadIdx.x];
__syncthreads();
for(j = 0; j < 512; j++)
if(tran_buffer[j] == item_idx) count++;
address += blockDim.x<<1;
}
// take time of 0.427968 ms
item_cnt[threadIdx.x] = 0; //count;
item_id[threadIdx.x] = item_idx;
item_cnt_rec[threadIdx.x] = 0;
__syncthreads();
// take time of 13.629280 ms when assign count to item_cnt[threadIdx.x]
// take time of 0.431072 ms when assign zero to item_cnt[threadIdx.x]
…
}[/codebox]
kernel configuration: blockDim =dim(128, 1), gridDim =dim(8, 1);
The times i got by commenting subsequential code show that assigning the automatic variable count to item_cnt[threadIdx.x] caused a terrible performance decline, which should not happen because it is just an automatic variable, and i think the variable must not be relocated in local memory. Even it is placed in local memory, there will not be a huge performance drop in terms of a accesses coalescence.
after more test, i found that variables like count, which is much more frequently modified, will have the problem that access to them is very expensive
It seem a problem of register & automatic variable, but i can’t figure it out, so hope someone can give me some advices
Thanks in advance