Registry per thread material

zarnick · November 13, 2012, 9:57am

I’m having a somewhat hard time finding material about this, I wwant to understand more how the Registry per thread works on each Kernel, I have a kernel that uses 26 registries, but honestly, what this means and how can I optimize it? Does anyone has a good literature on this?

Thank you.

thearchermdd · November 13, 2012, 4:00pm

I’d recommend reading the CUDA C Best Practices Guide, specifically the Registers section and the Occupancy Calculator section.

Rough overview: each symmetric multiprocessor has a block of 32-bit registers, registersPerSm. This number is based on the compute capability of the device. When you compile, your kernel code uses some registers, numRegisters. Since each thread needs its own registers, one block of threads will need registersPerBlock = numRegisters * numThreadsPerBlock. The maximum number of blocks of your kernel one SM can handle is floor(registersPerSm / registersPerBlock). If that maximum number is 0, your kernel won’t launch.

Using the Occupancy Calculator can help show you how many registers you’d need to free to increase occupancy, which is the number of blocks your Symmetric Multiprocessors can load for your kernel.

In terms of how to optimize it, you may need to remove some local variables from your kernel, even if this means repeating some calculations. Or you can add extra {} to your code so that some local variables go out of scope. It boils down to trial and error: is the increased number of operations offset by increasing occupancy?

seibert · November 17, 2012, 4:09pm

The compiler is very aggressive about optimization, so it is very difficult to alter the number of registers by modifying your source code. Removing local variables does not directly affect the number of registers because there is no guaranteed correspondence between C variables and registers. The same goes for using additional blocks to send variables out of scope. The compiler can tell when an intermediate value is no longer needed and will reuse the register before the variable goes out of scope. Sometimes you can help the compiler by reorganizing your code so that intermediate values don’t need to be kept for very long, but this is also extremely fickle.

nvcc does have the --maxrregcount N option, which will limit the compiler to use no more than N registers. This usually forces it to spill intermediate values to local memory, but in some cases this can improve performance if the local memory access is infrequent and you have a very serious occupancy problem.

njuffa · November 17, 2012, 9:17pm

My advice is to treat manipulating register usage as an advanced topic that most CUDA programmers can safely ignore. Something to worry about when there is still time for extreme tweaking at the end of the development cycle. Akin to CPU progammers worrying about the omission of stack frames, generation of leaf routines, or manipulating code layout during linking.

I would recmmend looking at the launch_bounds function attribute rather than the -maxrregcount compiler switch to influence register usage, as this allows control with kernel-level granularity rather than compilation-unit granularity.

zarnick · November 19, 2012, 10:57am

Thanks guys, this helped me set my work on the right optimization path (ie: not worring about register, but trying to paralellize more).

Topic		Replies	Views
Register Usage of my program To optimize scheduling of my program CUDA Programming and Performance	5	5899	December 15, 2007
Number of registers CUDA Programming and Performance	6	2088	March 24, 2009
registers available per thread (newbie Question) CUDA Programming and Performance	2	4411	January 4, 2008
how to reduce the number of registers CUDA Programming and Performance	5	8902	July 8, 2010
how many registers are needed for my kernel Is there a short explanation how to count the number of CUDA Programming and Performance	6	1657	January 22, 2011
Maximal threads per block calculation Calc based in reg and shared mem usage.. CUDA Programming and Performance	7	4979	June 30, 2008
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5901	July 25, 2007
Registers per thread limit and occupancy CUDA Programming and Performance	3	10053	March 30, 2007
Register demand CUDA Programming and Performance	2	2717	September 9, 2009
registers occupancy and # of threads CUDA Programming and Performance	3	3361	July 26, 2008

Registry per thread material

Related topics