I want to get an idea of how much slower division of some general types like double or float is compared to +, -, *. I’ve been able to find this https://www.agner.org/optimize/instruction_tables.pdf for the host, I’m looking for something similar for CUDA hardware. Now, I realize this would be dependent on the generation (Kepler will be different from Lovelace) and the type (double, float, int, 32bit compared to 64bit, …), but I think it should be possible to give some guesstimated comparison. It seems on the host division is actually not much slower than multiplication. Is the same true for CUDA? Not looking for exact numbers, but something like 1x, 10x, … compared to multiplication would be fine.

On a (Volta) V100, I would estimate the throughput ratio (multiplication:division) for `float`

at 1980159/157758 = ~12.55 and for `double`

at 2403962/310198 = ~7.75

```
$ cat t1837.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>
const int S = 32768;
template <typename T>
__global__ void k(T * __restrict__ d, const int N, const T denom){
for (int i = threadIdx.x+blockDim.x*blockIdx.x; i < N; i += gridDim.x*blockDim.x){
T my_d = d[i];
for (int j = 0; j < S; j++)
#ifdef USE_DIV
my_d /= denom;
#else
my_d *= denom;
#endif
d[i] = my_d;}
}
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
#ifndef USE_DOUBLE
typedef float mt;
#else
typedef double mt;
#endif
const int my_N = 1048576*32;
int main(){
mt *d, *h_d;
cudaMalloc(&d, my_N*sizeof(d[0]));
h_d = new mt[my_N];
for (int i = 0; i < my_N; i++) h_d[i] = 1.01;
mt num = 1.001;
k<<<160,1024>>>(d, my_N, num); // warm-up
cudaMemcpy(d, h_d, my_N*sizeof(d[0]), cudaMemcpyHostToDevice);
unsigned long long dt = dtime_usec(0);
k<<<160,1024>>>(d, my_N, num);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {std::cout << "CUDA error: " << cudaGetErrorString(err) << std::endl; return 0;}
std::cout << "elapsed time: " << dt << "us" << std::endl;
return 0;
}
$ nvcc -arch=sm_70 t1837.cu -o t1837
$ ./t1837
elapsed time: 157758us
$ nvcc -arch=sm_70 t1837.cu -o t1837 -DUSE_DIV
$ ./t1837
elapsed time: 1980159us
$ nvcc -arch=sm_70 t1837.cu -o t1837 -DUSE_DOUBLE
$ ./t1837
elapsed time: 310198us
$ nvcc -arch=sm_70 t1837.cu -o t1837 -DUSE_DOUBLE -DUSE_DIV
$ ./t1837
elapsed time: 2403962us
$
```

On a (Kepler) K20Xm, I would estimate the throughput ratio (multiplication:division) for `float`

at 15248483/ 881876 = ~17.29 and for `double`

at 17869279/1772434 = ~10.08

```
$ nvcc -arch=sm_35 t1837.cu -o t1837 -Wno-deprecated-gpu-targets
$ CUDA_VISIBLE_DEVICES="1" ./t1837
elapsed time: 881876us
$ nvcc -arch=sm_35 t1837.cu -o t1837 -Wno-deprecated-gpu-targets -DUSE_DIV
$ CUDA_VISIBLE_DEVICES="1" ./t1837
elapsed time: 15248483us
$ nvcc -arch=sm_35 t1837.cu -o t1837 -Wno-deprecated-gpu-targets -DUSE_DOUBLE
$ CUDA_VISIBLE_DEVICES="1" ./t1837
elapsed time: 1772434us
$ nvcc -arch=sm_35 t1837.cu -o t1837 -Wno-deprecated-gpu-targets -DUSE_DOUBLE -DUSE_DIV
$ CUDA_VISIBLE_DEVICES="1" ./t1837
elapsed time: 17869279us
```

I am puzzled. It does not make sense that the ratios would be smaller for `double`

and larger for `float`

. For both types division is implemented via an iterative process and starts with a HW approximation to the reciprocal of roughly the same accuracy (for technical reasons, slightly less accurate in the `double`

case, actually). It therefore stands to reason that the `double`

computation should require one additional iteration, and its cost measured in multiples of the cost of a multiply should therefore be *higher*.

I wonder whether the `float`

computation hits the slow path of the computation due to overflow or underflow? An alternative hypothesis would be that the `float`

division code is less optimized. Or maybe some third explanation that I am currently failing to take into account because it has been many years since I last looked at the emulation sequences in detail.

[Later:]

I built my own test framework independently, and the resulting data matches up quite well with Robert Crovella’s experiments. On my Quadro RTX 4000 (Turing), I find the following throughput:

single precision, throughput ratio of 15.7x between multiplies and divides:

4021.3 x 10^{9} FMULs per second

273.6 x 10^{9} FDIVs per second

double precision, throughput ratio of 8.8x between multiplies and divides:

112.6 x 10^{9} DMULs per second

12.8x 10^{9} DDIVs per second

The multiplication throughput matches what one would expect based on published specifications (within noise caused by dynamic clocking), giving me some confidence that the framework works correctly.

On further reflection, the initially surprising lower efficiency of the single-precision division relative to multiplies may be caused by function call overhead and lower SFU throughput (for the starting approximation of the reciprocal) relative to FP32 operations.

[Even later]:

Call overhead indeed seems to be the reason for the comparatively slow single-precision division. I performed an experiment where I replaced CUDA’s built-in division with my own function `fp32_div()`

which *inlines* the fastpath of the division algorithm and *calls* the slowpath when necessary. With this change I get a 461.1 x 10^{9} single-precision divisions per second on my Quadro RTX 4000, a **68% increase in throughput**.

I am showing my code below. While I am reasonably confident that this delivers correctly rounded results I have **not** rigorously shown that. Note that I cannot make this code fully efficient as there does not seem to be a way to get at the `FCHK`

instruction from CUDA. The fastpath of the single-precision division is small enough that it should be inlined, something NVIDIA might want to consider. Current x86 CPUs are certainly no slouch when it comes to single-precision division throughput. At a throughput of one `VDIVPS`

every 5 cycles per Agner Fog’s table, the HEDT Xeon W-3175X can crunch through 139 x 10^{9} per second if I didn’t make any mistakes in my arithmetic.

```
__noinline__ __device__ float fp32_div_slowpath (float dividend, float divisor)
{
return dividend / divisor;
}
/* Subnormal arguments and results are treated as zero */
__forceinline__ __device__ float rcp_approx_gpu (float divisor)
{
float r;
asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
return r;
}
__forceinline__ __device__ float fp32_div (float dividend, float divisor)
{
const uint32_t FP32_MANT_MASK = 0x007fffffu;
const uint32_t FP32_ONE = 0x3f800000u;
const uint32_t FP32_SIGN_MASK = 0x80000000u;
const uint32_t FP32_SIGN_EXPO_MASK = 0xff800000u;
const uint32_t FP32_QUOTIENT_HI_LIMIT = 0x7f7fffffu; // 0x1.0p+128 - ulp
const uint32_t FP32_QUOTIENT_LO_LIMIT = 0x00800001u; // 0x1.0p-126 + ulp
// fast path
float recip_approx = rcp_approx_gpu (divisor);
float recip_err = fmaf (recip_approx, -divisor, 1.0f);
float refined_recip = fmaf (recip_approx, recip_err, recip_approx);
float dividend_mant = __int_as_float ((__float_as_int (dividend) & FP32_MANT_MASK) | FP32_ONE);
float dividend_scale = __int_as_float (__float_as_int (dividend) & FP32_SIGN_EXPO_MASK);
float refined_quotient_mant = refined_recip * dividend_mant;
float refined_quotient_residual = fmaf (refined_quotient_mant, -divisor, dividend_mant);
float final_quotient_mant = fmaf (refined_recip, refined_quotient_residual, refined_quotient_mant);
float final_quotient = final_quotient_mant * dividend_scale;
// check if we need to apply the slow path and invoke it if that is the case
uint32_t iq = __float_as_int (final_quotient) & ~FP32_SIGN_MASK;
if ((iq - FP32_QUOTIENT_LO_LIMIT) > (FP32_QUOTIENT_HI_LIMIT - FP32_QUOTIENT_LO_LIMIT)) {
final_quotient = fp32_div_slowpath (dividend, divisor);
}
return final_quotient;
}
```

I confirm on v100 that using the code provided by njuffa, the `float`

division throughput goes up substantially/similarly, by a factor of 1980512/1214144 = ~1.63. The recomputed throughput ratio would be 1214144/157758 = ~7.7

On K20Xm the div:div ratio is 15248483/10869354 = ~1.40, with a recomputed mult:div ratio of 10869354/881876 = ~12.3

I filed RFE 3259380 to have the CUDA math team look at this. (Thanks njuffa.)

To be clear: The code I posted is *not* what should be considered for inlining. The implementation of single-precision division that CUDA 11.1 currently uses is 3 instruction shorter than the code I posted. So what is desired is simply inlining the *currently existing* fastpath code for the single-precision division. Sizewise (on the order of 15 instructions) , this is in line (no pun intended!) with other primitives the compiler routinely inlines.

Looking through Agner Fog’s tables, the throughput ratio between `VMULPS`

and `VDIVPS`

is 10x for Skylake, SkylakeX, and Coffeelake, and 20x for the older Broadwell architecture. The throughput ratios on GPUs are reasonably close to this, so I do not anticipate a need to differentiate by processor platform in `__host__ __device__`

code.

Clear. The RFE simply asks to investigate inlining the fast path.