register pressure

I have a kernel does a linear least square fit. It turns out threads are using too many registers, therefore, the occupancy is low. Here is the kernel,

__global__

void strainAxialKernel(

    float* d_dis,

    float* d_str

){

    int i = threadIdx.x;

    float a = 0;

    float c = 0;

    float e = 0;

    float f = 0;

    int shift = (int)((float)(i*NEIGHBOURS)/(float)WINDOW_PER_LINE);

    int j;

    __shared__ float dis[WINDOW_PER_LINE];

    __shared__ float str[WINDOW_PER_LINE];

// fetch data from global memory

    dis[i] = d_dis[blockIdx.x*WINDOW_PER_LINE+i];

    __syncthreads();

// least square fit

    for (j=-shift; j<NEIGHBOURS-shift; j++)                                     

    {                                                                           

        a += j;                                                                 

        c += j*j;                                                               

        e += dis[i+j];                                                          

        f += (float(j))*dis[i+j];                                               

    }                                                                       

    str[i] = AMP*(a*e-NEIGHBOURS*f)/(a*a-NEIGHBOURS*c)/(float)BLOCK_SPACING;    

// compensate attenuation

    if (COMPEN_EXP>0 && COMPEN_BASE>0)                                          

    {                                                                           

        str[i]                                                                  

        = (float)(str[i]*pow((float)i/(float)COMPEN_BASE+1.0f,COMPEN_EXP));     

    }   

// write back to global memory

    if (!SIGN_PRESERVE && str[i]<0)                                             

    {                                                                           

        d_str[blockIdx.x*WINDOW_PER_LINE+i] = -str[i];                          

    }                                                                           

    else                                                                        

    {                                                                           

        d_str[blockIdx.x*WINDOW_PER_LINE+i] = str[i];                           

    }

}

I use 96 threads per block. On GTS 250, the SM shall be able to handle 8 blocks. Yet, visual profiler shows I have 11 registers per thread, as a result, occupancy is 0.625 (5 blocks per SM). BTW, the shared memory used by each block is 792 B, so the register is the problem. The performance is not end of the world. I am just curious if there is anyway I can get around this. Thanks.

I have a kernel does a linear least square fit. It turns out threads are using too many registers, therefore, the occupancy is low. Here is the kernel,

__global__

void strainAxialKernel(

    float* d_dis,

    float* d_str

){

    int i = threadIdx.x;

    float a = 0;

    float c = 0;

    float e = 0;

    float f = 0;

    int shift = (int)((float)(i*NEIGHBOURS)/(float)WINDOW_PER_LINE);

    int j;

    __shared__ float dis[WINDOW_PER_LINE];

    __shared__ float str[WINDOW_PER_LINE];

// fetch data from global memory

    dis[i] = d_dis[blockIdx.x*WINDOW_PER_LINE+i];

    __syncthreads();

// least square fit

    for (j=-shift; j<NEIGHBOURS-shift; j++)                                     

    {                                                                           

        a += j;                                                                 

        c += j*j;                                                               

        e += dis[i+j];                                                          

        f += (float(j))*dis[i+j];                                               

    }                                                                       

    str[i] = AMP*(a*e-NEIGHBOURS*f)/(a*a-NEIGHBOURS*c)/(float)BLOCK_SPACING;    

// compensate attenuation

    if (COMPEN_EXP>0 && COMPEN_BASE>0)                                          

    {                                                                           

        str[i]                                                                  

        = (float)(str[i]*pow((float)i/(float)COMPEN_BASE+1.0f,COMPEN_EXP));     

    }   

// write back to global memory

    if (!SIGN_PRESERVE && str[i]<0)                                             

    {                                                                           

        d_str[blockIdx.x*WINDOW_PER_LINE+i] = -str[i];                          

    }                                                                           

    else                                                                        

    {                                                                           

        d_str[blockIdx.x*WINDOW_PER_LINE+i] = str[i];                           

    }

}

I use 96 threads per block. On GTS 250, the SM shall be able to handle 8 blocks. Yet, visual profiler shows I have 11 registers per thread, as a result, occupancy is 0.625 (5 blocks per SM). BTW, the shared memory used by each block is 792 B, so the register is the problem. The performance is not end of the world. I am just curious if there is anyway I can get around this. Thanks.

I wouldn’t see 62.5% as a low occupancy. You will hardly see any improvements after that.
You can however get to 83.3% occupancy with 128 threads per block. The scheduler also slightly prefers an even number of warps, i.e. multiples of 64 threads per block.

BTW I assumed each block uses 792 bytes of shared memory, not kilobytes.

I wouldn’t see 62.5% as a low occupancy. You will hardly see any improvements after that.
You can however get to 83.3% occupancy with 128 threads per block. The scheduler also slightly prefers an even number of warps, i.e. multiples of 64 threads per block.

BTW I assumed each block uses 792 bytes of shared memory, not kilobytes.

opps, ya, it is 792.

opps, ya, it is 792.