Register usage spike in SASS with divison slow/full path

Hello,

My colleagues and I have noticed that division seems to cause a jump in the register usage. Our kernels fuse many operations into one large complex kernel, and thus have limited registers available. Removing this jump in registers would help us reduce register pressure and improve performance.

I have created a basic example that reproduces this issue on a much simpler kernel. Attached along with the example is a script that compiles and runs the executable to produce the data in this post (run on a GH200) and an example SASS file.
div_reg_pressure.cu (7.5 KB)
div_reg_pressure_reg_128.sass (2.8 MB)
run.sh (1.0 KB)

As expected, divisions in the code get translated to blocks of SASS instructions like this

        /*4020*/       MUFU.RCP R5, R4 ;                                                        
        /*4030*/       BSSY B2, `(.L_x_69) ;                                                    
        /*4040*/       FCHK P0, R70, R4 ;                                                       
        /*4050*/       FFMA R0, -R4, R5, 1 ;                                                    
        /*4060*/       FFMA R6, R5, R0, R5 ;                                                    
        /*4070*/       FFMA R5, R70, R6, RZ ;                                                   
        /*4080*/       FFMA R0, -R4, R5, R70 ;                                                  
        /*4090*/       FFMA R5, R6, R0, R5 ;                                                    
        /*40a0*/  @!P0 BRA `(.L_x_70) ;                                                         
        /*40b0*/       MOV R0, R70 ;                                                            
        /*40c0*/       MOV R20, R4 ;                                                            
        /*40d0*/       MOV R21, 0x40f0 ;                                                        
        /*40e0*/       CALL.REL.NOINC `($__internal_0_$__cuda_sm3x_div_rn_noftz_f32_slowpath) ; 
        /*40f0*/       MOV R5, R0 ;                                                             
.L_x_70:                                                                                        
        /*4100*/       BSYNC B2 ;                                                               
.L_x_69:              

When I look at the live register usage with nvdisasm -plr , I see something like this.

        /*4020*/       MUFU.RCP R5, R4 ;                                                         // |  68
        /*4030*/       BSSY B2, `(.L_x_69) ;                                                     // |  68
        /*4040*/       FCHK P0, R70, R4 ;                                                        // |  68
        /*4050*/       FFMA R0, -R4, R5, 1 ;                                                     // |  69
        /*4060*/       FFMA R6, R5, R0, R5 ;                                                     // |  70
        /*4070*/       FFMA R5, R70, R6, RZ ;                                                    // |  69
        /*4080*/       FFMA R0, -R4, R5, R70 ;                                                   // |  70
        /*4090*/       FFMA R5, R6, R0, R5 ;                                                     // |  70
        /*40a0*/  @!P0 BRA `(.L_x_70) ;                                                          // |  68
        /*40b0*/       MOV R0, R70 ;                                                             // |  43
        /*40c0*/       MOV R20, R4 ;                                                             // |  43
        /*40d0*/       MOV R21, 0x40f0 ;                                                         // |  42
        /*40e0*/       CALL.REL.NOINC `($__internal_0_$__cuda_sm3x_div_rn_noftz_f32_slowpath) ;  // | 102
        /*40f0*/       MOV R5, R0 ;                                                              // |  68
.L_x_70:                                                                                         // |  67
        /*4100*/       BSYNC B2 ;                                                                // |  67
.L_x_69:              

There is notable jump in registers at CALL.REL.NOINC ($__internal_0_$__cuda_sm3x_div_rn_noftz_f32_slowpath).
In the double precision kernel (not shown) there is a corresponding jump at CALL.REL.NOINC ($__internal_1_$__cuda_sm20_div_rn_f64_full). I’m guessing it has something to do with the function call not being inlined. However, the live register view seems to indicate that previously unused registers are used. Additionally, the size the register jump seems to depend on the maximum number of registers used in the kernel. Looking at the SASS produced by the example problem while varying register limit produces the following jumps.

Registers Used Float Jump Double Jump
32 12 12
64 28 28
128 60 60
256 124 125

The jump in registers causes a noticeable difference in performance when compared to a custom division function. In the attached example, to create a custom division I used the PTX instructions for a quick division approximation followed by 4 Newton-Raphson iteration for refinement. When comparing the times using the two methods on the sample kernel with different max-register limits you can see that the custom division is faster, which is likely due to the difference in spilling.

maxrregcount Elements per Thread Type Intrinsic Div Time (s) Custom Div Time (s) spilling intrinsic div spilling custom div
32 3 Double 3.389280 1.600416 200 0
32 6 Float 1.348224 0.988000 72 0
64 6 Double 3.206976 1.850176 220 0
64 12 Float 1.650112 1.259200 96 0
128 12 Double 2.993760 2.011520 276 0
128 24 Float 1.894432 1.797600 120 72
256 24 Double 3.021728 2.468320 412 80
256 48 Float 2.667872 2.386368 188 152

My questions are

  1. What causes the jump in register usage?
  2. Why does the register jump size depend on the total number of registers?
  3. What can be done to avoid the jump in registers?

Thank you,
Josh

njuffa has commented on a number of forum posts related to division routines. While you are waiting to see if there are any more direct responses, a google search like this may possibly give you some interesting things to read.

No division operations of any kind are directly supported by GPU hardware. All of them are emulated. This takes either the form of a called subroutine, or (more frequently with modern toolchains) an inlined “fast path” handling common cases, with a called subroutine “slow path” that handles uncommon and exceptional cases. This second arrangement is what you can observe in the SASS code shown.

The complexity of the slow path often exceeds the complexity of the fast path by a considerable margin, and more registers are needed to hold temporary results. This is unavoidable, and this code has already been tuned to require as few registers as possible.

If you need to reduce register pressure, try:

(1) eliminate division operations by algebraic transformations

(2) floating-point: replace division with multiplication by the reciprocal. Instead of x / y use (1.0 / y) * x. In the specific case of transforming x/y+z into fma(1.0/y, x, z) the impact on accuracy is typically negligible thanks to the FMA, otherwise there will be minor negative impact on accuracy. The reason this can reduce register pressure somewhat is because reciprocal is emulated by a separate template which is of lower complexity than that used for division.

(3) floating-point, single-precision only: use the __fdividef() device function intrinsic to switch to an approximate division. This might not be suitable in all contexts, so you will have to evaluate its utility carefully. If you want a bigger hammer affecting all single-precision divisions, you could look at the compiler switch --prev-div=false. Another relevant compiler switch is--ftz=true, which is a mallet-sized hammer, so use it with caution. The compiler switch --use_fast_math comprises these two plus additional ones. See the documentation for these switches before deploying them.

Without a careful analysis of the entire code, I cannot explain that. From examining other instances in the past, I am reasonably certain that this is not actionable by itself but simply an artifact: a side effect caused by normal code transformations applied by the compiler in conjunction with the basic nature of the division emulation as described above.

Do you need the slow path for very small or large dividends, divisors or results at all? Regardless of whether it is inlined or not.

It is not clear to me why a custom division would require 4 Newton-Raphson iterations, unless the starting approximation is very inaccurate. The GPU’s hardware reciprocal instruction MUFU.RCP already delivers an approximation that is almost accurate to full single precision, and that is what __fdividef() uses internally, so there should be no need to role one’s own approximate single-precision division.

For a limited-range, approximate, double-precision division one could use the MUFU.RCP64H instruction (accessible via rcp.approx.ftz.f64 in PTX), which delivers results accurate to about 20 bits, and perform two Newton-Raphson iterations. A better choice would probably be to use a single Halley iteration with cubic convergence.

A very minimalist implementation might look like so:

__device__ double raw_recip64 (double a)
{
    asm ("rcp.approx.ftz.f64 %0,%0;\n\t" : "+d"(a));
    return a;
}

__device__ double fp64_quick_division (double x, double y)
{
    double e, r;
    r = raw_recip64 (y);
    // refine reciprocal with Halley iteration
    e = fma (r, -y, 1.0);
    e = fma (e, e, e);
    r = fma (e, r, r);
    return r * x;
}

Thank you for your replies!

Thank you for your suggestions, we have tried techniques like these to some degree of success, but are still left with places where we want to use divisions or reciprocals to full precision.

I understand that the slowpath should be there and is complex. I would expect the additional register cost to be constant, but this is not what I am seeing. Below is an excerpt of the SASS with all the live register data (scroll to the right). You can see that it is assigning to many registers, including 4 predicate registers. The use of so many registers seems very suspect. Now it may be that I am not correctly interpreting the live register data that I am seeing from nvdisasm or nsight compute. Yet, I see that the compiler is making room as if the jumps were real by spilling before the division and loading after

        /*4020*/       MUFU.RCP R5, R4 ;                                                         // |  68   : : : v ^                                              :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :     :     :  :  :  :                                               :  :              :  :     :  :                                        :   :           :       :   :           :       :       :   :           :          |            | 2         : :  |
        /*4030*/       BSSY B2, `(.L_x_69) ;                                                     // |  68   : : : : :                                              :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :     :     :  :  :  :                                               :  :              :  :     :  :                                        :   :           :       :   :           :       :       :   :           :          |            | 2         : :  |
        /*4040*/       FCHK P0, R70, R4 ;                                                        // |  68   : : : v :                                              :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :     :     :  :  :  v                                               :  :              :  :     :  :                                        :   :           :       :   :           :       :       :   :           :          | 1 ^        | 2         : :  |
        /*4050*/       FFMA R0, -R4, R5, 1 ;                                                     // |  69 ^ : : : v v                                              :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :     :     :  :  :  :                                               :  :              :  :     :  :                                        :   :           :       :   :           :       :       :   :           :          | 1 :        | 2         : :  |
        /*4060*/       FFMA R6, R5, R0, R5 ;                                                     // |  70 v : : : : v ^                                            :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :     :     :  :  :  :                                               :  :              :  :     :  :                                        :   :           :       :   :           :       :       :   :           :          | 1 :        | 2         : :  |
        /*4070*/       FFMA R5, R70, R6, RZ ;                                                    // |  69   : : : : ^ v                                            :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :     :     :  :  :  v                                               :  :              :  :     :  :                                        :   :           :       :   :           :       :       :   :           :          | 1 :        | 2         : :  |
        /*4080*/       FFMA R0, -R4, R5, R70 ;                                                   // |  70 ^ : : : v v :                                            :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :     :     :  :  :  v                                               :  :              :  :     :  :                                        :   :           :       :   :           :       :       :   :           :          | 1 :        | 2         : :  |
        /*4090*/       FFMA R5, R6, R0, R5 ;                                                     // |  70 v : : : : x v                                            :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :     :     :  :  :  :                                               :  :              :  :     :  :                                        :   :           :       :   :           :       :       :   :           :          | 1 :        | 2         : :  |
        /*40a0*/  @!P0 BRA `(.L_x_70) ;                                                          // |  68   : : : : :                                              :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :     :     :  :  :  :                                               :  :              :  :     :  :                                        :   :           :       :   :           :       :       :   :           :          | 1 v        | 2         : :  |
        /*40b0*/       MOV R0, R70 ;                                                             // |  43 ^ : :   :                                                :  :  :  :  :  :  :  :  :  :              :  :  :  :              :  :  :  :              :  :  :  :              :  :  :  :              :  :  v                                               :  :              :  :     :                                               :           :                       :       :                       :          |            |                |
        /*40c0*/       MOV R20, R4 ;                                                             // |  43 : : :   v                                          ^     :  :  :  :  :  :  :  :  :  :              :  :  :  :              :  :  :  :              :  :  :  :              :  :  :  :              :  :                                                  :  :              :  :     :                                               :           :                       :       :                       :          |            |                |
        /*40d0*/       MOV R21, 0x40f0 ;                                                         // |  42 : : :                                                 ^  :  :  :  :  :  :  :  :  :  :              :  :  :  :              :  :  :  :              :  :  :  :              :  :  :  :              :  :                                                  :  :              :  :     :                                               :           :                       :       :                       :          |            |                |
        /*40e0*/       CALL.REL.NOINC `($__internal_0_$__cuda_sm3x_div_rn_noftz_f32_slowpath) ;  // | 102 x v : ^ ^ ^ ^ ^ ^ ^  ^  ^  ^  ^  ^  ^                    :  :  :  :  :  :  :  :  :  :  ^  ^  ^  ^  :  :  :  :  ^  ^  ^  ^  :  :  :  :  ^  ^  ^  ^  :  :  :  :  ^  ^  ^  ^  :  :  :  :  ^  ^  ^  ^  :  :        ^  ^  ^  ^              ^  ^  ^  ^        :  :  ^  ^  ^  ^  :  :     :  ^  ^  ^  ^                   ^   ^   ^   ^   :           :   ^   ^   ^   ^       :       :   ^   ^   ^   ^       :          | 4 ^ ^ ^ ^  | 2         ^ ^  |
        /*40f0*/       MOV R5, R0 ;                                                              // |  68 v : : : : ^                                              :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :     :     :  :  :                                                  :  :              :  :     :  :                                        :   :           :       :   :           :       :       :   :           :          |            | 2         : :  |
.L_x_70:                                                                                         // |  67   : : : : :                                              :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :     :     :  :  :                                                  :  :              :  :     :  :                                        :   :           :       :   :           :       :       :   :           :          |            | 2         : :  |
        /*4100*/       BSYNC B2 ;                                                                // |  67   : : : : :                                              :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :     :     :  :  :                                                  :  :              :  :     :  :                                        :   :           :       :   :           :       :       :   :           :          |            | 2         : :  |
.L_x_69:        

// Legend:
//     ^       : Register assignment
//     v       : Register usage
//     x       : Register usage and reassignment
//     :       : Register in use
//     <space> : Register not in use
//     #       : Number of occupied registers         

If you would like to look at the code you can find it linked in the original post.
The excerpt of just the kernel is below.

// Kernel assumes each thread computes exactly M dot products
template <class T, unsigned int m, bool custom_div>
__global__ void dotProduct3D(const T* volatile A, const T* volatile B, T* volatile C) {
    int tid = threadIdx.x + blockDim.x * blockIdx.x;
    int base = tid * m;
    
    // Load A vector
    T A1[m];
    T A2[m];
    T A3[m];

    // Load and normalize B vector
    T B1[m];
    T B2[m];
    T B3[m];

    
    // load everything in
    // this simulate the register pressure of more complex kernels
    #pragma unroll
    for (int i = 0; i < m; ++i) {
        int idx = base + i;
        int i0 = 3 * idx;
        
        A1[i] = A[i0]; 
        A2[i] = A[i0 + 1]; 
        A3[i] = A[i0 + 2]; 

        B1[i] = B[i0]; 
        B2[i] = B[i0 + 1]; 
        B3[i] = B[i0 + 2]; 
        
    }
    
    // do some work that uses division
    #pragma unroll
    for (int i = 0; i < m; ++i) {
        int idx = base + i;
        
        // normalize the B vector
        if (custom_div){
            T mag2 = B1[i] * B1[i] + B2[i] * B2[i] + B3[i] * B3[i];
            mag2 = fastish_inverse(mag2);
            B1[i] *= mag2;
            B2[i] *= mag2;
            B3[i] *= mag2;
            
        } else {
            T mag2 = B1[i] * B1[i] + B2[i] * B2[i] + B3[i] * B3[i];
            B1[i] /= mag2;
            B2[i] /= mag2;
            B3[i] /= mag2;
        }
        
        // Compute dot product
        C[idx] = A1[i] * B1[i] + A2[i] * B2[i] + A3[i] * B3[i];
    }
}

I am not sure I trust the register pressure information from nvdisasm, but that is just me.

If, based on your analysis, you believe that either (1) nvdisasm is making misleading reports of register pressure or (2) that the CUDA toolchain is erroneously using too many registers in code involving floating-point divisions, you would want to file a bug with NVIDIA.

In practical terms, you have two choices: (1) Use the fully IEEE-754 compliant division emulation provided by CUDA and pay the price in register pressure and dynamic instruction count. (2) Use a reduced range, reduced accuracy division implementation (either your own or one offered by CUDA) to achieve lower register pressure.

There is no magic that provides a fully IEEE-754 compliant division with substantially lower register pressure. I was involved with the tuning of division emulation routines when I worked for NVIDIA over a decade ago, and they were revised multiple time back then. From looking at the generated SASS I am also quite sure that they have been revised at least twice since then. It is therefore unlikely that any substantially better solution can be found, but of course that is not necessarily entirely impossible. A person may come along that spots efficiency improvements previous engineers overlooked.

The basic trade-off here is supporting divisions in hardware, with an accompanying reduction in throughput for simple arithmetic, or punting division to software, with a price paid through increased dynamic instruction count and higher register pressure. NVIDIA’s conclusion apparently was that the latter solution delivers the most bang/buck overall. This also means some apps (like yours) are going to be exposed to the negative effects of this trade-off.

Guilty as charged here. The custom division is not as performant as possible and adapted from a CUDA Fortran routine that didn’t have as good of an initial guess (I don’t know how to elicit the MUFU instructions via Fortran). I appreciate your suggestions about how to make it better.

You might want to try fp64_quick_division() in my post above. That is the most minimalist implementation of an approximate FP64 division I can think off.

I have shared this with a devtec at NVIDIA who is going to ask around internally to see if this is a case of (1), (2), or something else. I will update this post when I know more.

I think the issue was that the added register pressure was not a constant number of registers.
Even if some loops are unrolled, the slow paths of different divisions should not be mixed and count several times?

Wrong reporting of the register pressure by nvdisasm could of course be a reason.

Did you had any return on whether it is (1) or (2) by any chance?

I believe I am seeing a similar problem in a code I am working on.

Hi @mikolaj.kowalski, I have no update and don’t expect I will every get an answer. We have simply been using the custom division as a work around. This of course avoids the slow path entirely and is risky.

If you filed a bug with NVIDIA, you should definitely get some kind of an answer.