Speed comparison of division compared to other arithmetic operations, perhaps something like clock cycles

I want to get an idea of how much slower division of some general types like double or float is compared to +, -, *. I’ve been able to find this https://www.agner.org/optimize/instruction_tables.pdf for the host, I’m looking for something similar for CUDA hardware. Now, I realize this would be dependent on the generation (Kepler will be different from Lovelace) and the type (double, float, int, 32bit compared to 64bit, …), but I think it should be possible to give some guesstimated comparison. It seems on the host division is actually not much slower than multiplication. Is the same true for CUDA? Not looking for exact numbers, but something like 1x, 10x, … compared to multiplication would be fine.

On a (Volta) V100, I would estimate the throughput ratio (multiplication:division) for float at 1980159/157758 = ~12.55 and for double at 2403962/310198 = ~7.75

$ cat t1837.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>

const int S = 32768;
template <typename T>
__global__ void k(T * __restrict__ d, const int N, const T denom){

  for (int i = threadIdx.x+blockDim.x*blockIdx.x; i < N; i += gridDim.x*blockDim.x){
    T my_d = d[i];
    for (int j = 0; j < S; j++)
#ifdef USE_DIV
      my_d /= denom;
#else
      my_d *= denom;
#endif

    d[i] = my_d;}
}

#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}


#ifndef USE_DOUBLE
typedef float mt;
#else
typedef double mt;
#endif

const int my_N = 1048576*32;
int main(){

  mt *d, *h_d;
  cudaMalloc(&d, my_N*sizeof(d[0]));
  h_d = new mt[my_N];
  for (int i = 0; i < my_N; i++) h_d[i] = 1.01;
  mt num = 1.001;
  k<<<160,1024>>>(d, my_N, num); // warm-up
  cudaMemcpy(d, h_d, my_N*sizeof(d[0]), cudaMemcpyHostToDevice);
  unsigned long long dt = dtime_usec(0);
  k<<<160,1024>>>(d, my_N, num);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  cudaError_t err = cudaGetLastError();
  if (err != cudaSuccess) {std::cout << "CUDA error: " << cudaGetErrorString(err) << std::endl; return 0;}
  std::cout << "elapsed time: "  << dt << "us" << std::endl;
  return 0;
}

$ nvcc -arch=sm_70 t1837.cu -o t1837
$ ./t1837
elapsed time: 157758us
$ nvcc -arch=sm_70 t1837.cu -o t1837 -DUSE_DIV
$ ./t1837
elapsed time: 1980159us
$ nvcc -arch=sm_70 t1837.cu -o t1837 -DUSE_DOUBLE
$ ./t1837
elapsed time: 310198us
$ nvcc -arch=sm_70 t1837.cu -o t1837 -DUSE_DOUBLE -DUSE_DIV
$ ./t1837
elapsed time: 2403962us
$

On a (Kepler) K20Xm, I would estimate the throughput ratio (multiplication:division) for float at 15248483/ 881876 = ~17.29 and for double at 17869279/1772434 = ~10.08

$ nvcc -arch=sm_35 t1837.cu -o t1837 -Wno-deprecated-gpu-targets
$ CUDA_VISIBLE_DEVICES="1" ./t1837
elapsed time: 881876us
$ nvcc -arch=sm_35 t1837.cu -o t1837 -Wno-deprecated-gpu-targets -DUSE_DIV
$ CUDA_VISIBLE_DEVICES="1" ./t1837                             
elapsed time: 15248483us
$ nvcc -arch=sm_35 t1837.cu -o t1837 -Wno-deprecated-gpu-targets -DUSE_DOUBLE
$ CUDA_VISIBLE_DEVICES="1" ./t1837                             
elapsed time: 1772434us
$ nvcc -arch=sm_35 t1837.cu -o t1837 -Wno-deprecated-gpu-targets -DUSE_DOUBLE -DUSE_DIV
$ CUDA_VISIBLE_DEVICES="1" ./t1837                             
elapsed time: 17869279us

I am puzzled. It does not make sense that the ratios would be smaller for double and larger for float. For both types division is implemented via an iterative process and starts with a HW approximation to the reciprocal of roughly the same accuracy (for technical reasons, slightly less accurate in the double case, actually). It therefore stands to reason that the double computation should require one additional iteration, and its cost measured in multiples of the cost of a multiply should therefore be higher.

I wonder whether the float computation hits the slow path of the computation due to overflow or underflow? An alternative hypothesis would be that the float division code is less optimized. Or maybe some third explanation that I am currently failing to take into account because it has been many years since I last looked at the emulation sequences in detail.

[Later:]

I built my own test framework independently, and the resulting data matches up quite well with Robert Crovella’s experiments. On my Quadro RTX 4000 (Turing), I find the following throughput:

single precision, throughput ratio of 15.7x between multiplies and divides:
4021.3 x 109 FMULs per second
273.6 x 109 FDIVs per second

double precision, throughput ratio of 8.8x between multiplies and divides:
112.6 x 109 DMULs per second
12.8x 109 DDIVs per second

The multiplication throughput matches what one would expect based on published specifications (within noise caused by dynamic clocking), giving me some confidence that the framework works correctly.

On further reflection, the initially surprising lower efficiency of the single-precision division relative to multiplies may be caused by function call overhead and lower SFU throughput (for the starting approximation of the reciprocal) relative to FP32 operations.

[Even later]:

Call overhead indeed seems to be the reason for the comparatively slow single-precision division. I performed an experiment where I replaced CUDA’s built-in division with my own function fp32_div() which inlines the fastpath of the division algorithm and calls the slowpath when necessary. With this change I get a 461.1 x 109 single-precision divisions per second on my Quadro RTX 4000, a 68% increase in throughput.

I am showing my code below. While I am reasonably confident that this delivers correctly rounded results I have not rigorously shown that. Note that I cannot make this code fully efficient as there does not seem to be a way to get at the FCHK instruction from CUDA. The fastpath of the single-precision division is small enough that it should be inlined, something NVIDIA might want to consider. Current x86 CPUs are certainly no slouch when it comes to single-precision division throughput. At a throughput of one VDIVPS every 5 cycles per Agner Fog’s table, the HEDT Xeon W-3175X can crunch through 139 x 109 per second if I didn’t make any mistakes in my arithmetic.

__noinline__ __device__ float fp32_div_slowpath (float dividend, float divisor) 
{ 
    return dividend / divisor; 
}

/* Subnormal arguments and results are treated as zero */
__forceinline__ __device__ float rcp_approx_gpu (float divisor)
{
    float r;
    asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
    return r;
}

__forceinline__ __device__ float fp32_div (float dividend, float divisor)
{
    const uint32_t FP32_MANT_MASK = 0x007fffffu;
    const uint32_t FP32_ONE = 0x3f800000u;
    const uint32_t FP32_SIGN_MASK = 0x80000000u;
    const uint32_t FP32_SIGN_EXPO_MASK = 0xff800000u;
    const uint32_t FP32_QUOTIENT_HI_LIMIT = 0x7f7fffffu; // 0x1.0p+128 - ulp
    const uint32_t FP32_QUOTIENT_LO_LIMIT = 0x00800001u; // 0x1.0p-126 + ulp

    // fast path
    float recip_approx = rcp_approx_gpu (divisor);
    float recip_err = fmaf (recip_approx, -divisor, 1.0f);
    float refined_recip = fmaf (recip_approx, recip_err, recip_approx);
    float dividend_mant = __int_as_float ((__float_as_int (dividend) & FP32_MANT_MASK) | FP32_ONE);
    float dividend_scale = __int_as_float (__float_as_int (dividend) & FP32_SIGN_EXPO_MASK);
    float refined_quotient_mant = refined_recip * dividend_mant;
    float refined_quotient_residual = fmaf (refined_quotient_mant, -divisor, dividend_mant);
    float final_quotient_mant = fmaf (refined_recip, refined_quotient_residual, refined_quotient_mant);
    float final_quotient = final_quotient_mant * dividend_scale;

    // check if we need to apply the slow path and invoke it if that is the case
    uint32_t iq = __float_as_int (final_quotient) & ~FP32_SIGN_MASK;
    if ((iq - FP32_QUOTIENT_LO_LIMIT) > (FP32_QUOTIENT_HI_LIMIT - FP32_QUOTIENT_LO_LIMIT)) {
        final_quotient = fp32_div_slowpath (dividend, divisor);
    }
    return final_quotient;
}

2 Likes

I confirm on v100 that using the code provided by njuffa, the float division throughput goes up substantially/similarly, by a factor of 1980512/1214144 = ~1.63. The recomputed throughput ratio would be 1214144/157758 = ~7.7

On K20Xm the div:div ratio is 15248483/10869354 = ~1.40, with a recomputed mult:div ratio of 10869354/881876 = ~12.3

I filed RFE 3259380 to have the CUDA math team look at this. (Thanks njuffa.)

1 Like

To be clear: The code I posted is not what should be considered for inlining. The implementation of single-precision division that CUDA 11.1 currently uses is 3 instruction shorter than the code I posted. So what is desired is simply inlining the currently existing fastpath code for the single-precision division. Sizewise (on the order of 15 instructions) , this is in line (no pun intended!) with other primitives the compiler routinely inlines.

Looking through Agner Fog’s tables, the throughput ratio between VMULPS and VDIVPS is 10x for Skylake, SkylakeX, and Coffeelake, and 20x for the older Broadwell architecture. The throughput ratios on GPUs are reasonably close to this, so I do not anticipate a need to differentiate by processor platform in __host__ __device__ code.

Clear. The RFE simply asks to investigate inlining the fast path.

When disassembling some code produced by CUDA 11.8 today, I noticed that the optimization of inlining the fastpath of single-precision division is now in place. It probably was added in some earlier version of CUDA 11, but I did not check those.

__global__ void kernel (float a, float b, float *r)
{
    *r = a / b;
}

Compiling the above for -arch=sm_75 with the CUDA 11.8 toolchain results in the following machine code (SASS), as disassembled by cuobjdump --dump-sass:

/*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;      /* 0x00000a00ff017624 */
/*0010*/                   MUFU.RCP R0, c[0x0][0x164] ;                 /* 0x0000590000007b08 */
/*0020*/                   ULDC UR4, c[0x0][0x160] ;                    /* 0x0000580000047ab9 */
/*0030*/                   IMAD.MOV.U32 R5, RZ, RZ, c[0x0][0x164] ;     /* 0x00005900ff057624 */
/*0040*/                   IMAD.U32 R4, RZ, RZ, UR4 ;                   /* 0x00000004ff047e24 */
/*0050*/                   FCHK P0, R4, c[0x0][0x164] ;                 /* 0x0000590004007b02 */
/*0060*/                   FFMA R3, R0, -R5, 1 ;                        /* 0x3f80000000037423 */
/*0070*/                   FFMA R3, R0, R3, R0 ;                        /* 0x0000000300037223 */
/*0080*/                   FFMA R0, R3, c[0x0][0x160], RZ ;             /* 0x0000580003007a23 */
/*0090*/                   FFMA R2, R0, -R5, c[0x0][0x160] ;            /* 0x0000580000027623 */
/*00a0*/                   FFMA R0, R3, R2, R0 ;                        /* 0x0000000203007223 */
/*00b0*/              @!P0 BRA 0xe0 ;                                   /* 0x0000002000008947 */
/*00c0*/                   MOV R8, 0xe0 ;                               /* 0x000000e000087802 */
/*00d0*/                   CALL.REL.NOINC 0x110 ;                       /* 0x0000003000007944 */ slowpath of FP32 division
/*00e0*/                   ULDC.64 UR4, c[0x0][0x168] ;                 /* 0x00005a0000047ab9 */
/*00f0*/                   STG.E.SYS [UR4], R0 ;                        /* 0x00000000ff007986 */
/*0100*/                   EXIT ;                                       /* 0x000000000000794d */
[...]

From a cursory glance, the inlined fastpath of the FP32 division looks about tight as possible.

1 Like

The following is purely speculative. Based on the disassembly, the FCHK instruction is clearly responsible for deciding whether the fastpath of the FP32 division is sufficient, or whether the slowpath needs to be invoked. It would be interesting to have some idea for which operand pairs the fastpath of FP32 division is invoked, and division therefore fast. In line with the historical pattern, NVIDIA does not document the detailed operation of the FCHK instruction.

I did not want to spend the effort to reverse engineer the exact behavior of the FCHK instruction. Instead, I made some rough estimates based on the fasthpath algorithm itself, which yielded the following hypothetical implementation:

/* returns 1 if the slowpath of fp32 division is required; 0 otherwise */
__forceinline__ __device__ int my_fchk (float dividend, float divisor)
{    
    int expo_divd = ((__float_as_int (dividend) & 0x7f800000) >> 23) - 127;
    int expo_divr = ((__float_as_int (divisor)  & 0x7f800000) >> 23) - 127;
    int expo_quot = expo_divd - expo_divr;
    return ((expo_quot < -125) || (expo_quot > 127) || 
            (expo_divr < -126) || (expo_divr > 125) ||
            (expo_divd < -102) || (expo_divd > 127));
}
2 Likes

Hello, I have a question and hope to get an answer. When using the fast path calculation, can we get the correct result by using separate multiplication and addition instead of the FMA instruction? Can we implement floating-point division using the Newton iteration method if FMA instructions are not supported? Can we use Dekker’s algorithm to achieve higher precision multiplication and addition instead of FMA? I tested the fast path using the numerical range you provided and found that the result was correct even without using the FMA instruction. Of course, this only tested a portion of the numbers in the range you provided.