How do you calculate the number of registers per thread?

Hi!

As the number of registers per thread can become a limiting factor for occupancy, I want to know how many registers my kernel uses…

I’ve read the programming guide and the best practices guide and what I’ve understood from there is that all automatic variables are placed in registers (unless they consume too much memory, case in which they are placed in local memory).

So I have the following kernel (simple matrix multiplication kernel):

__global__ void mulMatrixKernel( float* g_matrix_A, float* g_matrix_B, float* g_matrix_C, int rows, int cols) 

{

  // access thread id

  const unsigned int row = blockIdx.y*TILE_DIM+threadIdx.y;

  const unsigned int col = blockIdx.x*TILE_DIM+threadIdx.x;

  float sum=0.0f;

//perform computation

  if(row<rows && col<cols)

	  for(int i=0;i<rows;i++)

		  sum+=g_matrix_A[row*cols+i]*g_matrix_B[i*cols+col];

g_matrix_C[row*cols+col]=sum;

}

I would say that I have three registers here: row, col, sum. But according to the visual profiler I have 9. Now that is a big difference.

I have another question related to this problem: where are the intermediate results of computations stored? (for example in the upper kernel: blockIdx.y * TILE_DIM + threadIdx.y). And one last question also related to this problem: where are the variables threadIdx, blockIdx, blockDim and gridDim stored and which is the latency for reading these variables?

Loop counter i and pointers g_matrix_A+rowcols+i and g_matrix_B+icols+col will also be in registers. Some intermediates too. My understanding is that variables like threadIdx are available in special-purpose registers.

If you really want to understand how registers are used in your kernel, use a disassembler such as decuda.

Vasily

Loop counter i and pointers g_matrix_A+rowcols+i and g_matrix_B+icols+col will also be in registers. Some intermediates too. My understanding is that variables like threadIdx are available in special-purpose registers.

If you really want to understand how registers are used in your kernel, use a disassembler such as decuda.

Vasily