Registry per thread material

I’m having a somewhat hard time finding material about this, I wwant to understand more how the Registry per thread works on each Kernel, I have a kernel that uses 26 registries, but honestly, what this means and how can I optimize it? Does anyone has a good literature on this?

Thank you.

I’d recommend reading the CUDA C Best Practices Guide, specifically the Registers section and the Occupancy Calculator section.

Rough overview: each symmetric multiprocessor has a block of 32-bit registers, registersPerSm. This number is based on the compute capability of the device. When you compile, your kernel code uses some registers, numRegisters. Since each thread needs its own registers, one block of threads will need registersPerBlock = numRegisters * numThreadsPerBlock. The maximum number of blocks of your kernel one SM can handle is floor(registersPerSm / registersPerBlock). If that maximum number is 0, your kernel won’t launch.

Using the Occupancy Calculator can help show you how many registers you’d need to free to increase occupancy, which is the number of blocks your Symmetric Multiprocessors can load for your kernel.

In terms of how to optimize it, you may need to remove some local variables from your kernel, even if this means repeating some calculations. Or you can add extra {} to your code so that some local variables go out of scope. It boils down to trial and error: is the increased number of operations offset by increasing occupancy?

The compiler is very aggressive about optimization, so it is very difficult to alter the number of registers by modifying your source code. Removing local variables does not directly affect the number of registers because there is no guaranteed correspondence between C variables and registers. The same goes for using additional blocks to send variables out of scope. The compiler can tell when an intermediate value is no longer needed and will reuse the register before the variable goes out of scope. Sometimes you can help the compiler by reorganizing your code so that intermediate values don’t need to be kept for very long, but this is also extremely fickle.

nvcc does have the --maxrregcount N option, which will limit the compiler to use no more than N registers. This usually forces it to spill intermediate values to local memory, but in some cases this can improve performance if the local memory access is infrequent and you have a very serious occupancy problem.

My advice is to treat manipulating register usage as an advanced topic that most CUDA programmers can safely ignore. Something to worry about when there is still time for extreme tweaking at the end of the development cycle. Akin to CPU progammers worrying about the omission of stack frames, generation of leaf routines, or manipulating code layout during linking.

I would recmmend looking at the launch_bounds function attribute rather than the -maxrregcount compiler switch to influence register usage, as this allows control with kernel-level granularity rather than compilation-unit granularity.

Thanks guys, this helped me set my work on the right optimization path (ie: not worring about register, but trying to paralellize more).