Is it possible to replace integer division by floating-point division for speed

982630737 · March 23, 2022, 9:08am

In the CUDA device code, I have to calculate integer division. I wonder if it is possible to replace it with floating-point division. For example, replace the function

__device__ int div(int x, int y) {
    return x / y;
}

by

__device__ int div(int x, int y) {
   return (int)((float)x / (float)y);
}

The reason is that from the NVIDIA forum, I learn that integer division needs hundreds of clock cycles while floating-point division only needs about ten cycles. Another reason is that my kernel faces high register pressure. I know that integer division consists of multiple instructions and I guess it may use a lot of registers. But I wonder if the conversion between integer and float is efficient.

Some additional information: both x and y in my program are positive integers and are not very large (no more than 1e5). So I suspect the precision of floating-point is enough. Also, y is not a constant so the compiler can not optimize the integer division. Finally, my kernel is computation bounded (not memory bounded) and there is an integer division in the for loop, so I think it is important to make it faster.

njuffa · March 23, 2022, 9:34am

I don’t think that reflects the situation accurately. I assume we are talking 32-bit integer division here? Which GPU architecture?

NVIDIA GPUs have no hardware support for divisions. All division operations are emulated, and there are multiple versions of integer division and multiple versions of floating-point division. The emulation sequences eat up instructions and register for temporary variables.

If you want to know whether replacing integer division with floating-point division is beneficial to performance in your use case, I would suggest to simply try it. That should be a 5 minute experiment, given that there is a single division in the kernel that needs replacing.

cbuchner1 · March 23, 2022, 9:44am

As your range of possible divisors y is limited, you might create a table of inverses stored in a fixed point precision and replace the division with a table lookup, followed by multiplication. You are probably going to end up having to do 64 bit multiplications but that is still fairly fast.

Alternatively store the inverses as FP32 or FP64 constants and do this multiplication in floating point, casting back to int for the result.

The inverse table may fit into constant memory, which offers cached access.

njuffa · March 23, 2022, 12:04pm

Here is a special division function for unsigned integers less than 100,000 called udiv_lt_100000, together with the test code that checks its correctness exhaustively. I added a define USE_OWN_CVT to turn on float/int and int/float conversions that bypass the GPU conversion instructions. Whether that is faster likely depends on GPU architecture.

#include <stdio.h>
#include <stdlib.h>

#define LEN          (100000)
#define THREADS      (128)
#define USE_OWN_CVT  (0)

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

__forceinline__ __device__ float rcp_approx_gpu (float divisor)
{
    float r;
    asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
    return r;
}

/* integer division for dividend and divisor < 100,000 */
__device__ unsigned int udiv_lt_100000 (unsigned int x, unsigned int y)
{
    const float magic = 1.2582912e+7f; // 0x1.8p+23
    const unsigned int magic_i = __float_as_int (magic);

#if USE_OWN_CVT
    float divisor  = __int_as_float (magic_i | y) - magic;
    float dividend = __int_as_float (magic_i | x) - magic;
#else // USE_OWN_CVT 
    float divisor = (float)y;
    float dividend = (float)x;
#endif // USE_OWN_CVT
    float t = rcp_approx_gpu (divisor);
    t = __int_as_float (__float_as_int (t) + 1); // make sure quotient is never too big
    t = t * dividend;
#if USE_OWN_CVT
    t = __fadd_rz (t, magic);
    unsigned int q = __float_as_int (t) & 0x1ffff;
#else // USE_OWN_CVT 
    unsigned int q = (unsigned int)t;
#endif// USE_OWN_CVT 
    return q;
}

__global__ void divtest (unsigned int *result, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        unsigned int dividend = i;
        unsigned int flag = 0;
        for (unsigned int divisor = 1; divisor < LEN; divisor++) {
            unsigned int ref = dividend / divisor;
            unsigned int res = udiv_lt_100000 (dividend, divisor);
            flag += res != ref;
        }
        result [i] = flag;
    }
}

int main (void)
{
    dim3 dimBlock(THREADS);
    int threadBlocks = (LEN + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);
    unsigned int *r = 0, *d_r = 0;

    r = (unsigned int *) malloc (sizeof (r[0]) * LEN);
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_r, sizeof(d_r[0]) * LEN));
    CUDA_SAFE_CALL (cudaMemset(d_r, 0xff, sizeof(d_r[0]) * LEN)); // all ones
    divtest<<<dimGrid,dimBlock>>>(d_r, LEN);
    CHECK_LAUNCH_ERROR();
    CUDA_SAFE_CALL (cudaMemcpy (r, d_r, sizeof (r[0]) * LEN, cudaMemcpyDeviceToHost));

    for (int i = 0; i < LEN; i++) {
        if (r[i] == 0xffffffff) {
            printf ("counter not written i=%d\n", i);
            return EXIT_FAILURE;
        }
        if (r[i] != 0) {
            printf ("division failures i=%d  r=%u\n", i, r[i]);
            return EXIT_FAILURE;
        }
    }
    CUDA_SAFE_CALL (cudaFree (d_r));
    free (r);
    return EXIT_SUCCESS;
}

982630737 · March 23, 2022, 1:48pm

Thanks a lot! Yes, as is written in the two functions, the type is int (32-bit). The GPU is RTX 3090 (Ampere). It seems that using floating-point division indeed reduces the register usage.

njuffa · March 23, 2022, 6:49pm

Generally an unsigned integer division will be slightly faster than a signed integer division of the same width. There is additional overhead to get the sign handling correct.

For Ampere, the unsigned 32-bit integer division seems to be an inlined sequence of 17 instructions of which 7 are IMAD. This seems to eat up 9 registers (versus the 3 that would be required if there were a hardware division instruction). The fast path of the float division is a called subroutine of ten instructions, plus CAL/RET makes 12 instructions. It appears to use 7 registers. Using it for integer division adds three more conversion instructions for a total of 15.

Therefore, on Ampere, using the floating-point division improves throughput by an estimated 25% and saves two registers compared with the unsigned int division. My special udiv_lt_100000() division routine requires ten inlined instructions when compiled with USE_OWN_CVT = 1 and needs only 5 registers. It should double the throughput compared to the built-in unsigned int division (estimate) and reduces register usage by 4.

982630737 · March 24, 2022, 1:10am

Thanks so much! Now I have a clear picture about the speed of different situations.

I just figured out one additional thing: for my kernel the divisor in the for loop is kept the same (although it is different for various kernel launches). So in this case is there any trick that can further speed up the calculations? (For example, @cbuchner1 mentioned that we can pre-calculate the inverse of the current divisor but I am not clear how it compares with floating-point operations in your implementation). Thank you again!

njuffa · March 24, 2022, 1:23am

If there is more than one instance of the same divisor used in the kernel, the compiler should be able to exploit that through common subexpression elimination (CSE), provided the division code is inlined. Since 32-bit integer division is inlined code for Ampere from what I can see, you should be good. What, if any, optimizations the compiler can perform across a function boundary in the case of the ‘float’ subroutine, I do not know.

Computing the reciprocal will use the MUFU.RCP instruction no matter whether integer o floating-point division. It is basically a table-lookup plus interpolation on an internal ROM table, so my expectation is that any manual table lookup to replace it would provide only small improvement. The easiest way to find out is to implement variants and then profile.

If you are willing to learn to read assembly language, you can use cuobdump --dump-sass to see what the compiler does transform your code to. Constant propagation and CSE are some of the beard & butter optimizations the CUDA compiler will aggressively pursue.

Robert_Crovella · March 24, 2022, 1:47am

I would have thought that for a reused floating point divisor, computing the reciprocal and then converting the divisions to multiplications would be an obvious win. (…hasty benchmarking…) But sure enough, it isn’t. Nice work, compiler!

njuffa · March 24, 2022, 1:59am

The way the CUDA toolchain has matured, it is usually a good idea to write CUDA code in a straightforward natural manner. Only after finding proof that the computer is doing something sub-optimal should one engage in manual source code re-writing. What compilers in general cannot do is optimize based on numeric range limitations (as is the case here). In most programming languages such constraints cannot even be expressed.

The matured CUDA compiler is capable of transforming source code so extensively that I found on more than one occasion that I have serious trouble trying to follow the resulting SASS and identifying bits and pieces of the computation I specified at HLL level, even though I have lots of experience reading SASS.

Topic		Replies	Views
Bug with integer division? CUDA Programming and Performance	33	9368	September 9, 2015
Speed comparison of division compared to other arithmetic operations, perhaps something like clock cycles CUDA Programming and Performance	9	5938	November 19, 2024
Forward looking GPU integer performance CUDA Programming and Performance	22	21687	March 20, 2017
integer division and modulo CUDA Programming and Performance	14	22832	November 10, 2008
Faster division on 64-bit unsigned integers CUDA Programming and Performance	4	2056	July 15, 2017
Division problem (weird behavior) CUDA Programming and Performance	23	18053	November 15, 2010
What does LOP3.LUT mean? How is it executed? CUDA Programming and Performance	22	4463	February 8, 2024
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	19961	March 12, 2014
Peak Performance of integer operation CUDA Programming and Performance	3	2883	May 11, 2017
How slow is integer division and modulo? CUDA Programming and Performance	11	11011	September 23, 2008

Is it possible to replace integer division by floating-point division for speed

Related topics