The number of registers used in a kernel and the performance are related to the way the local variables are processed?

I have encountered a problem in optimizing my kernel. I just found the number of registers increase and the performance decreases when some local variables recalled after several other computation or memory oprations which are not related to these variables.

For example,100 variables are defined in one kernal. First,three of them, a,b,c, are called by assinment like a = d+f,b=ef,c=g/h. Then, some oprations of other variables(don’t include a,b,c) are processed. Finally,a,b,c are recalled and updated with their former values e.g. a= a+s,b=b-k,c=ch. I found it does little effect with the register and the performance if a, b,c are just reassigned with nnew values not related to the former valued,e.g. a=1,b=2,
c=k, while invoking former values causes bad effect. Also, if the first call and the second call are neighboled,or there are less operations between the two calls, the performance and register are almost unchanged.

There is another example, 7 double local variables are recalled with their former values after some operations,and the runtime goes up to 168ms and the number of registers is increased to 154, while when the 7 variables are updated without their former values,the number of regiters is 120 and the runtime is 112ms.

I don’t known the reason. Could you Please give me some instructions about this problem? Thank you very much.

The CUDA toolchain consists of two optimizing compilers, NVVN which translates high-level code to PTX intermediate format, and PTXAS which compiles PTX into machine code (SASS) for the specified target architecture(s). PTXAS is the component responsible for instruction scheduling and register allocation. In general, PTXAS uses more registers only if it “believes” that this will improve performance in the trade-off with occupancy. Most of the time the choices are very good, occasionally the choices made are noticeably sub-optimal.

Both compilers consist of multiple phases that collectively apply numerous code transformations, controlled by a plethora of heuristics. Especially for big complicated kernels with many variables it is impossible to predict directly how source code changes will affect the machine code.

In terms of HLL-level code changes that can affect register use, there are: float vs double computation (a float variable requires one register, a double variable requires two registers), the use of certain math functions or math operations (e.g. divisions), relaxed math computation requirements (–use_fast_math may lead to a reduction in register usage), use of loop unrolling pragmas, function inlining attributes, use of the restrict modifier for pointer arguments.

You can also direct the compiler to reduce its register usage with the -maxrreg compiler switch of nvcc or with the launch_bounds function attribute. Be aware that their use often results in lower performance despite higher achieved occupancy (an indication that the compiler typically “knows best”).

Before you start experimenting with compiler control mechanisms and brittle source code modifications, I would suggest using the CUDA profiler to help you pinpoint the performance bottleneck(s) in your code, so you have a better idea where to focus your efforts. Studying the CUDA Best Practices Guide is also highly recommended.

Many thanks for your information. I am sure these local variables I defined in kernel are fully located in registers. I have considered the difference that double precision variable consumes double number of registers as float. There are 49 double and 10 integer variables used in my kernel, so the number of total registers is close to 120 in the case the 7 variables recalled without reading their original values, while the number of registers increases significantly (154) when these variables recalled by reading their former values(it is similar to reading followed by a storing). It seems these recalled variables (reading is required when recalled) shouldn’t be far away from the position they are first initialed, or negative effect will occure. Some people think the reason maybe found in the assembly instructions,but I am not familiar with that.

Your suggestions are very useful. I am following your suggestions,hope I can fix these problem soon. Thank you.

Becoming acquainted with deciphering a machine code dump produced by cuobjdump --dump-sass is certainly a useful skill when trying to wring the last quantum of performance from CUDA code as a “ninja” programmer. But acquiring this skill takes a lot of practice and is hard and tedious work even for modest-sized kernels. Manually back-annotating a kernel with 100 machine instructions to determine cause and effect may take half a work day.

With large code and given the many code transformations known to the compiler it becomes virtually impossible to trace connections between source code and final machine code with any amount of certainty. There can also be interactions between optimizations, e.g. due to phase ordering.

I speak from personal experience here. For several years (while employed at NVIDIA) I worked very closely with the CUDA compiler team on both functional and performance issues. I would claim I dissected more SASS in detail during that time than any one person on the compiler team.

Thank you. So I will start some HLL-level optimizations only first.