I have a kernel that access global memory in a for loop, and then followed with a data-dependent divergent while loop.
[codebox]global void my_kernel(int *g_data,int rows,int cols,int *g_results,…)
{
int i=0,j;
extern shared int s_data;
int word;
int res=0;
const int tid=blockDim.x*blockIdx.x+threadIdx.x;
/some code to assign the s_data/
__syncthreads();
for(j=0;j<rows;++j)
{
//__syncthreads();
word=g_data[j*cols+tid];
while(s_data[i]<word&&i<cols)
{
++i;
}
if(s_data[i]==word)
res+=s_data[i];
//__syncthreads();
}
g_results[tid]=res;
}[/codebox]
Is this coalesed access? I have tried adding __syncthreads() in the for loop ,but the performance doesn’t improve.
I am a newbie to the CUDA programming, so any suggestion for my code is greatly appreciated.
Thanks