I have a kernel does a linear least square fit. It turns out threads are using too many registers, therefore, the occupancy is low. Here is the kernel,
__global__
void strainAxialKernel(
float* d_dis,
float* d_str
){
int i = threadIdx.x;
float a = 0;
float c = 0;
float e = 0;
float f = 0;
int shift = (int)((float)(i*NEIGHBOURS)/(float)WINDOW_PER_LINE);
int j;
__shared__ float dis[WINDOW_PER_LINE];
__shared__ float str[WINDOW_PER_LINE];
// fetch data from global memory
dis[i] = d_dis[blockIdx.x*WINDOW_PER_LINE+i];
__syncthreads();
// least square fit
for (j=-shift; j<NEIGHBOURS-shift; j++)
{
a += j;
c += j*j;
e += dis[i+j];
f += (float(j))*dis[i+j];
}
str[i] = AMP*(a*e-NEIGHBOURS*f)/(a*a-NEIGHBOURS*c)/(float)BLOCK_SPACING;
// compensate attenuation
if (COMPEN_EXP>0 && COMPEN_BASE>0)
{
str[i]
= (float)(str[i]*pow((float)i/(float)COMPEN_BASE+1.0f,COMPEN_EXP));
}
// write back to global memory
if (!SIGN_PRESERVE && str[i]<0)
{
d_str[blockIdx.x*WINDOW_PER_LINE+i] = -str[i];
}
else
{
d_str[blockIdx.x*WINDOW_PER_LINE+i] = str[i];
}
}
I use 96 threads per block. On GTS 250, the SM shall be able to handle 8 blocks. Yet, visual profiler shows I have 11 registers per thread, as a result, occupancy is 0.625 (5 blocks per SM). BTW, the shared memory used by each block is 792 B, so the register is the problem. The performance is not end of the world. I am just curious if there is anyway I can get around this. Thanks.