The symbol name suggests that this is code for compute capability 3.x. Is your target GPU actually a GPU with compute capablity 3.x? If not, you would want to specify the correct target architecture on the nvcc
commandline.
I compiled a sample single-precision division for sm_30
, and it seems (I have not fully annotated the disassembly, as it is quite tedious) the division slowpath is taken for overflow and underflow cases. Given your range of divisors, it seems likely that you are hitting the underflow case, i.e. many of your dividends are already very small and produce subnormal quotients.
Even without hitting the slowpath, an IEEE-754 compliant floating division is not going to be as fast as the approximate division __fdividef()
. The single-precision division fastpath for sm_30
is a called subroutine of 17 instructions, while even with -ftz=false
, __fdividef()
results in 5 inlined instructions.
I think I have noted before that for optimal performance NVIDIA should look into inlining the single-precision division fastpath, leaving just the slowpath as a called subroutine. Ah, here: