Understanding Register Count in Visual Profiler

Hello Everyone,

I’m trying to understand what I am seeing for the Visual Profiler regarding the number of registers per thread. For my kernel, it says 32 under “registers per thread”. Does that mean that each thread uses 32 registers? What things contribute to this? I certainly did not do: “int x = …” 32 times or anything like that. I can’t say that I understand where this coming from. I have a setup kernel that uses (as I see it) 4 registers, but the profiler says 15.

I will just show the setup kernel code:

[codebox]global void setup(double* d_values,

				  double**    d_valuesPtr,

				  double*     d_ntimes,

				  double**    d_ntimesPtr,

				  TimeStamp*  d_times,

				  TimeStamp** d_timesPtr,

				  ulong*      d_tiles,

				  ulong**     d_tilesPtr,

				  ulong*      d_offsetArray,

				  ulong*      d_valuesLengthRow,

				  double*     d_c,

				  double**    d_cPtr,

				  double*     d_p,

				  double**    d_pPtr,

				  double*     d_exp_deltas,

				  double**    d_exp_deltasPtr)

{

int i = blockIdx.x;

ulong lenRow = d_valuesLengthRow[i];

ulong slice  = lenRow % (unsigned)blockDim.x;

slice = 0 ? (lenRow / (unsigned)blockDim.x) : ((lenRow / (unsigned)blockDim.x) + 1);

d_valuesPtr[i]     = &d_values[d_offsetArray[i]];

d_ntimesPtr[i]     = &d_ntimes[d_offsetArray[i]];

d_timesPtr[i]      = &d_times [d_offsetArray[i]];

d_cPtr[i]          = &d_c[i * gridDim.x];

d_pPtr[i]          = &d_p[i * gridDim.x]];

d_exp_deltasPtr[i] = &d_exp_deltas[d_offsetArray[i]];

d_tilesPtr[i]      = &d_tiles[i * (NUM_TILES + 1)];

int n;

int boundary = (1 + ((threadIdx.x + 1) * slice));

if (lenRow > 0)

	for (n = 1 + threadIdx.x * slice; n < boundary && n < lenRow; ++n)

		d_exp_deltasPtr[i][n] = exp(d_ntimesPtr[i][n-1]-d_ntimesPtr[i][n]);

}[/codebox]

I am in the process of changing the way I allocate my multidimensional arrays. I do allocate a 1D array with 2D pointers to the correct spots, but this is generally not a good idea (even though this change speeds up my kernel almost none). I call the kernel like so:

[codebox]setup<<<numVars,gpuMaxThreads>>>[/codebox]

On my kernel methods, I have cut out nearly half of the registers I allocate, but the “32 registers per thread” is still reported by the profiler. :/

For my kernel, I see 12 registers used (profiler says 32). For the setup, I count 5 (profiler says 15). I replace all the “i” with “blockIdx.x”, cutting down the register usage by one, but the profiler still did not change.

Low-level registers and high-level C variables do not have a one-to-one mapping. Register allocation is done by ptxas, and it is a non-trivial procedure. It is very difficult to estimate how many registers will be required just by looking at the C source code. Subexpressions might require multiple registers to evaluate, and the compiler can decide to keep the results of an expression in a register to avoid calculating it later. Moreover, the same register can be used for two different variables, if the usage of those variables does not overlap in the code. Trivial things, like eliminating an intermediate variable in your source tend to not have any effect on register usage, since the compiler was already doing that for you.

Oh I see. Well, hmmm, then I am a bit at a loss, haha.

If you need to reduce the register count, you can “encourage” the compiler and assembler to use less with the --maxrregcount option. Note that this can slow down your kernel because if there is no way to do the calculation with the register limit you set, it will start to spill intermediate values to local memory. Local memory is short for “thread-local memory” and is actually stored in global memory, which is much slower than registers.

There are also tricks to convince the register allocator to use fewer registers:

http://forums.nvidia.com/index.php?showtopic=168974

Another way to reduce register count is to allocate some variables in shared memory.
Shared memory should be just as fast as registers and there’s no need to synchronize.

N.

But doesn’t the pointer to the shared memory use a register per thread, making it pointless?

Not if you store more than 1 variable in shm.

N.

Would the high register number also be due to the fact that I am using 64-bit values? So, to store a double, it needs 2 32-bit register to hold it. So, when I am seeing 15 registers, then it is really like I am using at least 30 registers right there. That make sense?

You may want to check out section 5.2.3 of the programming guide.

edit: spellcheck :)

N.

64-bit values do increase the register usage, but I believe that the reported register number is always in units of 32-bit registers (so 64-bit register counts as 2).