# How do you calculate the number of registers per thread?

Hi!

As the number of registers per thread can become a limiting factor for occupancy, I want to know how many registers my kernel usesâ€¦

Iâ€™ve read the programming guide and the best practices guide and what Iâ€™ve understood from there is that all automatic variables are placed in registers (unless they consume too much memory, case in which they are placed in local memory).

So I have the following kernel (simple matrix multiplication kernel):

``````__global__ void mulMatrixKernel( float* g_matrix_A, float* g_matrix_B, float* g_matrix_C, int rows, int cols)

{

const unsigned int row = blockIdx.y*TILE_DIM+threadIdx.y;

const unsigned int col = blockIdx.x*TILE_DIM+threadIdx.x;

float sum=0.0f;

//perform computation

if(row<rows && col<cols)

for(int i=0;i<rows;i++)

sum+=g_matrix_A[row*cols+i]*g_matrix_B[i*cols+col];

g_matrix_C[row*cols+col]=sum;

}
``````

I would say that I have three registers here: row, col, sum. But according to the visual profiler I have 9. Now that is a big difference.

I have another question related to this problem: where are the intermediate results of computations stored? (for example in the upper kernel: blockIdx.y * TILE_DIM + threadIdx.y). And one last question also related to this problem: where are the variables threadIdx, blockIdx, blockDim and gridDim stored and which is the latency for reading these variables?

Loop counter i and pointers g_matrix_A+rowcols+i and g_matrix_B+icols+col will also be in registers. Some intermediates too. My understanding is that variables like threadIdx are available in special-purpose registers.

If you really want to understand how registers are used in your kernel, use a disassembler such as decuda.

Vasily

Loop counter i and pointers g_matrix_A+rowcols+i and g_matrix_B+icols+col will also be in registers. Some intermediates too. My understanding is that variables like threadIdx are available in special-purpose registers.

If you really want to understand how registers are used in your kernel, use a disassembler such as decuda.

Vasily