Some Visual Profiler questions


I’ve just started programming on a GTX260 and I’m trying to use the visual profiler in order to measure the performance.

I’ve figured out some of the parameters and profiler counters, but I still have trouble in understanding some of them.

I’ve written the following, very simple matrix multiplication kernel (which doesn’t use shared memory):

__global__ void mulMatrixKernel( float* g_matrix_A, float* g_matrix_B, float* g_matrix_C, int rows, int cols) 


  // access thread id

  const unsigned int row = blockIdx.y*TILE_DIM+threadIdx.y;

  const unsigned int col = blockIdx.x*TILE_DIM+threadIdx.x;

  float sum=0.0f;

//perform computation

  if(row<rows && col<cols)

	  for(int i=0;i<rows;i++)




With the visual profiler, I’m getting the following values:

1.Static shared memory per block: 36. I don’t use shared memory, so the only shared memory that is used, is for the parameters of the kernel. But how are the 36 bytes distributed for the five parameters?

2.Registers per thread: 9. I can’t see more than three: row, col and sum.

Further on, in the summary table (View->Summary table), there is a column called “instruction throughput”, which doesn’t have any unit, it’s just a number (for me 0.413).

In the help file, I found the following explanation:

“This is the ratio of achieved instruction rate to peak single issue instruction rate. The achieved instruction rate is calculated using the “instructions” profiler counter. The peak instruction rate is calculated based on the GPU clock speed. In the case of instruction dual-issue coming into play, this ratio shoots up to greater than 1.”

Can someone tell me this in other words?

Thanks a lot!