Does wrapping registers in code blocks (curly braces) allow them to be freed from register memory

I am attempting to remain below 32 registers per thread on my 6.1 compute capability card in order to maximize occupancy. My kernel is very global memory bottlenecked. If I have registers that are mid-kernel and they are wrapped in a code block (curly braces), after they are out of scope, will the register memory be freed?

I’ve posted an example below:

... previous code ...

    {
        float x = sin(a[idx] + arctan(b[idx] / c[idx]));
        float y = sin(d[idx] + arctan(e[idx] / f[idx]));

        z[idx] = x + y;
        w[idx] = y - x;
    }

... code continues ...

In summary, are those 2 register spaces now able to be reused without passing the optimum of 32 registers per thread? Are there any other optimization tricks I should be aware of?

Thank You
Mick

The compiler should be able to recognize that those variables go out of scope, of course.

However the compiler should be able to make appropriate register optimizations in this case without you taking the special steps to create a new scope. If you found a well defined counter example, it would be suggested that you file that example as a bug for improvement in the compiler.

General statements about register utilization with respect to a specific level (e.g. 32 registers) really can’t be made, IMO. You have to have a specific, complete code example, to assess whether changes have any impact on register pressure/usage.

Ok. Thank you. That’s what I thought, but I wasn’t completely sure.

I don’t have a counter example. Is there a way to see the maximum number of registers my kernel is using without manually counting?

There are at least 2 ways:

  • at compile time, you can pass the -Xptxas -v option to the compiler (nvcc) and it will spit out various additional information about resource utilization determined by ptxas. One of the output items is registers used (assigned by the compiler).

  • at run time, you can use the various profilers which will give you this information. Since register allocation by the hardware is done with a particular granularity, the report at compile time and run time may differ by a few registers.

If you want to cut down on the number of registers used in this particular piece of code, research whether you can replace the use of trig functions with algebraic computation. The code suggests that you are applying some sort of rotation in two-dimensional space, so I would look into the applicability of a rotation matrix by which the 2D points are multiplied. Or maybe a more general transformation matrix (https://en.wikipedia.org/wiki/Transformation_matrix) is applicable? If you don’t need high accuracy you could also look into use of the __sinf() intrinsic device function, or try compiling the code with -use_fast_math.

I further note that there is no function ‘arctan’ in CUDA, and that the usage of ‘arctan’ here suggests that what you really want is the standard function ‘atan2’.

Thank you. Both messages have been very helpful!