Hello,
My colleagues and I have noticed that division seems to cause a jump in the register usage. Our kernels fuse many operations into one large complex kernel, and thus have limited registers available. Removing this jump in registers would help us reduce register pressure and improve performance.
I have created a basic example that reproduces this issue on a much simpler kernel. Attached along with the example is a script that compiles and runs the executable to produce the data in this post (run on a GH200) and an example SASS file.
div_reg_pressure.cu (7.5 KB)
div_reg_pressure_reg_128.sass (2.8 MB)
run.sh (1.0 KB)
As expected, divisions in the code get translated to blocks of SASS instructions like this
/*4020*/ MUFU.RCP R5, R4 ;
/*4030*/ BSSY B2, `(.L_x_69) ;
/*4040*/ FCHK P0, R70, R4 ;
/*4050*/ FFMA R0, -R4, R5, 1 ;
/*4060*/ FFMA R6, R5, R0, R5 ;
/*4070*/ FFMA R5, R70, R6, RZ ;
/*4080*/ FFMA R0, -R4, R5, R70 ;
/*4090*/ FFMA R5, R6, R0, R5 ;
/*40a0*/ @!P0 BRA `(.L_x_70) ;
/*40b0*/ MOV R0, R70 ;
/*40c0*/ MOV R20, R4 ;
/*40d0*/ MOV R21, 0x40f0 ;
/*40e0*/ CALL.REL.NOINC `($__internal_0_$__cuda_sm3x_div_rn_noftz_f32_slowpath) ;
/*40f0*/ MOV R5, R0 ;
.L_x_70:
/*4100*/ BSYNC B2 ;
.L_x_69:
When I look at the live register usage with nvdisasm -plr , I see something like this.
/*4020*/ MUFU.RCP R5, R4 ; // | 68
/*4030*/ BSSY B2, `(.L_x_69) ; // | 68
/*4040*/ FCHK P0, R70, R4 ; // | 68
/*4050*/ FFMA R0, -R4, R5, 1 ; // | 69
/*4060*/ FFMA R6, R5, R0, R5 ; // | 70
/*4070*/ FFMA R5, R70, R6, RZ ; // | 69
/*4080*/ FFMA R0, -R4, R5, R70 ; // | 70
/*4090*/ FFMA R5, R6, R0, R5 ; // | 70
/*40a0*/ @!P0 BRA `(.L_x_70) ; // | 68
/*40b0*/ MOV R0, R70 ; // | 43
/*40c0*/ MOV R20, R4 ; // | 43
/*40d0*/ MOV R21, 0x40f0 ; // | 42
/*40e0*/ CALL.REL.NOINC `($__internal_0_$__cuda_sm3x_div_rn_noftz_f32_slowpath) ; // | 102
/*40f0*/ MOV R5, R0 ; // | 68
.L_x_70: // | 67
/*4100*/ BSYNC B2 ; // | 67
.L_x_69:
There is notable jump in registers at CALL.REL.NOINC ($__internal_0_$__cuda_sm3x_div_rn_noftz_f32_slowpath).
In the double precision kernel (not shown) there is a corresponding jump at CALL.REL.NOINC ($__internal_1_$__cuda_sm20_div_rn_f64_full). I’m guessing it has something to do with the function call not being inlined. However, the live register view seems to indicate that previously unused registers are used. Additionally, the size the register jump seems to depend on the maximum number of registers used in the kernel. Looking at the SASS produced by the example problem while varying register limit produces the following jumps.
| Registers Used | Float Jump | Double Jump |
|---|---|---|
| 32 | 12 | 12 |
| 64 | 28 | 28 |
| 128 | 60 | 60 |
| 256 | 124 | 125 |
The jump in registers causes a noticeable difference in performance when compared to a custom division function. In the attached example, to create a custom division I used the PTX instructions for a quick division approximation followed by 4 Newton-Raphson iteration for refinement. When comparing the times using the two methods on the sample kernel with different max-register limits you can see that the custom division is faster, which is likely due to the difference in spilling.
| maxrregcount | Elements per Thread | Type | Intrinsic Div Time (s) | Custom Div Time (s) | spilling intrinsic div | spilling custom div |
|---|---|---|---|---|---|---|
| 32 | 3 | Double | 3.389280 | 1.600416 | 200 | 0 |
| 32 | 6 | Float | 1.348224 | 0.988000 | 72 | 0 |
| 64 | 6 | Double | 3.206976 | 1.850176 | 220 | 0 |
| 64 | 12 | Float | 1.650112 | 1.259200 | 96 | 0 |
| 128 | 12 | Double | 2.993760 | 2.011520 | 276 | 0 |
| 128 | 24 | Float | 1.894432 | 1.797600 | 120 | 72 |
| 256 | 24 | Double | 3.021728 | 2.468320 | 412 | 80 |
| 256 | 48 | Float | 2.667872 | 2.386368 | 188 | 152 |
My questions are
- What causes the jump in register usage?
- Why does the register jump size depend on the total number of registers?
- What can be done to avoid the jump in registers?
Thank you,
Josh