Hi!
As the number of registers per thread can become a limiting factor for occupancy, I want to know how many registers my kernel uses…
I’ve read the programming guide and the best practices guide and what I’ve understood from there is that all automatic variables are placed in registers (unless they consume too much memory, case in which they are placed in local memory).
So I have the following kernel (simple matrix multiplication kernel):
__global__ void mulMatrixKernel( float* g_matrix_A, float* g_matrix_B, float* g_matrix_C, int rows, int cols)
{
// access thread id
const unsigned int row = blockIdx.y*TILE_DIM+threadIdx.y;
const unsigned int col = blockIdx.x*TILE_DIM+threadIdx.x;
float sum=0.0f;
//perform computation
if(row<rows && col<cols)
for(int i=0;i<rows;i++)
sum+=g_matrix_A[row*cols+i]*g_matrix_B[i*cols+col];
g_matrix_C[row*cols+col]=sum;
}
I would say that I have three registers here: row, col, sum. But according to the visual profiler I have 9. Now that is a big difference.
I have another question related to this problem: where are the intermediate results of computations stored? (for example in the upper kernel: blockIdx.y * TILE_DIM + threadIdx.y). And one last question also related to this problem: where are the variables threadIdx, blockIdx, blockDim and gridDim stored and which is the latency for reading these variables?