Freeing of temporary registers

Hi all,

I am having trouble understanding how nvcc frees temporary registers, if at all. In the following code example data is loaded from global memory, in shared memory.

// Direction data

__shared__ int tSize;

__shared__ float doseWeight;

__shared__ float3 beamAxis;

__shared__ int3 startPos;

__shared__ int3 delta;

// Direction struct data

__shared__ float3 direction;

__shared__ float3 a;

__shared__ float3 aa;

__shared__ float3 b;

__shared__ float3 bb;

__shared__ float primaryNormal;

__shared__ float scatterNormal;

__shared__ float theta;

__shared__ float solidAngle;

// Only the first thread has to load and initialize the shared data

if ( threadIdx.x == 0 )

{  

  // Get the current direction index

  unsigned int di = blockIdx.x + directionsOffset;

  

  // Transfer all data for this direction from global to shared memory

  tSize             = data[di].tSize;               

  doseWeight   = data[di].doseWeight;

  beamAxis      = data[di].beamAxis;

  startPos        = data[di].startPos;

  delta            = data[di].delta;

  direction       = data[di].direction;

  a                  = data[di].a;

  aa                = data[di].aa;

  b                  = data[di].b;

  bb                 = data[di].bb;

  primaryNormal = data[di].primaryNormal;

  scatterNormal = data[di].scatterNormal;

  theta              = data[di].theta;

  solidAngle       = data[di].solidAngle;

}

Looking at the ptx file one notices the compiler generates a ld.global and a st.shared instruction for every value, using a temporary register, but it apparantly does not reuse this register when loading the next value since the cubin file points out this code example uses 30 registers. This severely limits the occupancy of our kernel.

However, changing the code to the following yield a register usage of 16 registers.

// Only the first thread has to load and initialize the shared data

if ( threadIdx.x == 0 )

{  

  // Get the current direction index

  unsigned int di = blockIdx.x + directionsOffset;

  

  // Transfer all data for this direction from global to shared memory

  tSize             = data[di].tSize;               

  doseWeight   = data[di].doseWeight;

  beamAxis      = data[di].beamAxis;

  startPos        = data[di].startPos;

  delta            = data[di].delta;

  direction       = data[di].direction;

}

if ( threadIdx.x == 0 )

{  

  // Get the current direction index

  unsigned int di = blockIdx.x + directionsOffset;

 a                  = data[di].a;

  aa                = data[di].aa;

  b                  = data[di].b;

  bb                 = data[di].bb;

  primaryNormal = data[di].primaryNormal;

  scatterNormal = data[di].scatterNormal;

  theta              = data[di].theta;

  solidAngle       = data[di].solidAngle;

}

Note: the second if statement is necessary here. Just limiting the scope with an extra set of parentheses did not change anything.

Is this a known problem, or is there something I could do to optimize my code other than the above “hack”? Any insights are appreciated, thanks.

PS: I am using VS7.0 compiler under WinXP SP2 and a 8800GTX

I have noticed similar behavior - in one kernel I do a bilinear interpolation of data from the shared memory and I do this 8 times for different sets of values, if I write it as:

...

//interp1

//interp2

//interp3

//interp4

...

The number of registers scales linearly with the number of interpolations I do. If I write it as:

for (int i = 0; i < 8; i++) {

    if (i == 0)

          //interp1

    else if (i == 1)

         //interp2

     ...

}

The compiler seems to use far fewer registers.

We have also noticed that expressing code as a loop can prevent the compiler from chewing up registers unnecessarily. It’s an ugly hack to have to do though, and I hope we get better control over register use in future revs of the compiler, through optimization flags, pragmas, or similar methods.

John

Thanks for the replies. I have tried expressing some of our code as loops and it works for most of our cases. A compiler setting would certainly be appreciated as for tools to analyse which piece of code is using the maximum amount of registers, since that is an exercise in the use of #if 0 now and it is sometimes hard to tell what te compiler optimizes away when you are not using certain variables.

The .ptx file lists more registers than end up being used (compare the register count in the .ptx and .cuda files). Also, according to the documentation, register use may be optimized even furthre by the driver. An ugly (and coarse) way to check what happens is to try launching a kernel with varying numbers of threads. Once you get a failure, you’ll get a bound on how many registers are actually being used.

Paulius

I understood that the reg = comment in the .cubin file was accurate and that any further optimisation would not change register count.

At this point I would like to express my request that Nvidia change the definition of the .ptx file to make it much closer to the binary actually run on the device so that when things go wrong one can have some chance of working it out.

I have one example where adding a for loop inside a routine costs 10 registers when there is only 1 loop invariant expression (apart from the termination condition) within the loop. Impossible to tell what has gone wrong and attempt a workaround the way it is at present.

Could it be that the current pseudo code definition is more about the place Nvidia would like to quarantine the open source component of the compiler than the best place for us?

A flag like -r10 to nvcc to stop it doing common subexpression and other tricks over the peak register use point of your code is sorely needed. If the compiler goes over your limit you know that it is the way you have structured your code, not its blind optimisation. The compiler or especially the loader should NOT automatically move registers out to device memory as it can’t do the coalescing and other planning that is required for efficient use of device memory. That bit in section 4.2.2.4 about automatic movement of registers should be deleted, and the tools changed. Same reasons that local memory is being deprecated. If it does not fit then the launch should fail (with some more useful diagnostic!)

Anyone who agrees with the request for an assembler file closer to the loaded code image please post your vote here! Hate to have to disassemble the .cubin without any doco.

Thanks, Eric