How to determine number of register per thread How to determine number of register per thread from a

Hello everyone,

Already this is my first post so hopefully I perfect spot at the right post, and sorry for my English but I’m French … External Image

So here’s my problem:

I’m working on a Tesla c2050 in double precision, and watching the result with cuda profile I don’t understand how cuda determined the number of register per threads.

For exemple here is kernel (it is a little long sorry):

__global__ void

FindIntersection (int modelId,

		  TModel *models, 

		  int nRay,

		  int *panels,

		  double * thit,

		  double *icrapPtr,

		  double *tPtr,

		  double *nv)

{

  if (blockIdx.x >= nRay) return;

  if (threadIdx.x >= MAX_THREAD_PER_BLOCK_2) return;

__shared__ int s_found;

  __shared__ double s_tmin[MAX_THREAD_PER_BLOCK_2];

  __shared__ int s_ipanel[MAX_THREAD_PER_BLOCK_2];

  s_found = 0;

  s_tmin[threadIdx.x] = MY_INFINITY;

  s_ipanel[threadIdx.x] = -1;

__syncthreads();

int i, delta;

  TModel *model = &models[modelId];

  double th_thit;

delta = (model->nCube / blockDim.x) + 1;

for (i = threadIdx.x * delta ; (i < threadIdx.x * delta + delta) && (i < model->nCube) ; i++)

    {

      th_thit = thit[i * MAX_NB_RAY + blockIdx.x];

      if (th_thit < s_tmin[threadIdx.x])

	{

	  s_tmin[threadIdx.x] = th_thit;

	  s_ipanel[threadIdx.x] = panels[i * MAX_NB_RAY + blockIdx.x];

	  s_found = 1;

	}

    }

__syncthreads();

if (threadIdx.x == 0)

    {

      if (s_found)

	{

	  th_thit = MY_INFINITY;

	  for (i=0 ; i < MAX_THREAD_PER_BLOCK_2 ; i++)

	    {

	      if (s_tmin[i] < th_thit)

		{

		  th_thit = s_tmin[i];

		  delta = s_ipanel[i];

		}

	    }

	  	  

	  icrapPtr[blockIdx.x] = 1;

	  tPtr[blockIdx.x] = th_thit;

	  

	  nv[3 * blockIdx.x] = model->triangles[delta * 12 + 7];

	  nv[3 * blockIdx.x + 1] = model->triangles[delta * 12 + 8];

	  nv[3 * blockIdx.x + 2] = model->triangles[delta * 12 + 9];

	}

      else

	{

	  icrapPtr[blockIdx.x] = 0;

	  tPtr[blockIdx.x] = -1;

	}

    }

}

So from this code, how many register are used by each thread in your opinion ?

Thank you

PS : The correct answer according cuda profile is 24 … BUT WHY ???

Pass the -Xptxas=“-v” option to nvcc, and the compiler will emit the exact register usages of the compiled kernel. It is impossible to estimate register usage from uncompiled C code - the compiler and assembler uses very complex and aggressive optimization strategies that include code reordering, register re-use, spilling to local memory, dead code removal, result computation during compilation, function in-lining and a whole bunch of other stuff. You have asked a “how long is a piece of string?” question to which there is no answer other than to compile the code, and the code you posted won’t compile…

Thank you for answer
So we can not determine the number of registers used per thread because NVCC optimize my code …

But in other functions, it uses 63 registers per thread (the maximum External Image ) and so it uses the local memory to compensate for this limitation … External Image
How to decrease the number of registers by changing my code to avoid using local memory?

Err no. The compiler option I gave you shows the exact number of registers the assembler used for your code. But when you say “determine” do you mean “control”? If you do, then there is the --maxrregcount option which will provide a hard limit on how many registers the assembler will try and use for the kernel. Usually it will result in more spilling to local memory (ie. “there is no such thing as a free lunch”), but you can use the volatile keyword to provide hints to the compiler about what should be preferentially spilled to local memory. The optimal solution probably requires some benchmarking to see what is fastest overall. Fewer registers might give more occupancy, but it probably also requires more local memory transactions. The result may not be any faster than a version using more registers, but less local memory.

I use -Xptxas=“-v” but it does not work. The compiler does not emit any information about register usages. The host compiler is gcc-8.0.