Division Slow Path

njuffa · January 7, 2022, 8:43pm

The symbol name suggests that this is code for compute capability 3.x. Is your target GPU actually a GPU with compute capablity 3.x? If not, you would want to specify the correct target architecture on the nvcc commandline.

I compiled a sample single-precision division for sm_30, and it seems (I have not fully annotated the disassembly, as it is quite tedious) the division slowpath is taken for overflow and underflow cases. Given your range of divisors, it seems likely that you are hitting the underflow case, i.e. many of your dividends are already very small and produce subnormal quotients.

Even without hitting the slowpath, an IEEE-754 compliant floating division is not going to be as fast as the approximate division __fdividef(). The single-precision division fastpath for sm_30 is a called subroutine of 17 instructions, while even with -ftz=false, __fdividef() results in 5 inlined instructions.

I think I have noted before that for optimal performance NVIDIA should look into inlining the single-precision division fastpath, leaving just the slowpath as a called subroutine. Ah, here:

Topic		Replies	Views
Division problem (weird behavior) CUDA Programming and Performance	23	18131	November 15, 2010
Problem on __fdividef(). CUDA Programming and Performance	6	4386	March 17, 2009
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	50044	August 24, 2009
Huge instruction stream for reciprocal on CC 2.0 reciprocal operation side effect? CUDA Programming and Performance	8	5464	September 13, 2011
How is FP64 division implemented CUDA Programming and Performance	13	1601	January 15, 2020
Is there a speed difference between div.approx and rcp.approx? CUDA Programming and Performance	7	5025	August 14, 2011
32 bit Float value question Zero insignificant bits after decimal pt CUDA Programming and Performance	5	2570	July 2, 2008
accuracy of fp division CUDA Programming and Performance	8	9046	February 17, 2009
error in modulo operation CUDA Programming and Performance	12	16169	September 20, 2009
Bug with integer division? CUDA Programming and Performance	33	9453	September 9, 2015

Division Slow Path

Related topics