Weekend project: FP64 emulation on consumer-grade GPUs

njuffa · April 29, 2024, 4:03am

Techniques for quasi double precision computation using pairs of float operands are well known and well suited to GPUs. However, the effective precision achieved by this approach is a bit lower than when using true double computation. More importantly, one remains limited by the narrow dynamic range (exponent range) of float.

Ever since NVIDIA introduced consumer GPUs that execute FP64 operations at 1/64 of the rate of FP32 operations, I have been wondering whether it could make sense to simply emulate proper IEEE-754 FP64 arithmetic using integer-based fixed-point computation. Back in the days of 200 MHz FPU-less 32-bit ARM processors, one could emulate an FP64 multiply in less than 60 cycles using manually crafted assembly code. The instruction set of 32-bit ARM processors is particularly well suited to such endeavors, presumably by design, as floating-point coprocessors were expensive add-ons when the architecture was first designed. The integer instructions provided by the GPU are not quite as sophisticated.

For this reason, replicating this feat with compiled CUDA code seemed a bit daunting, so I figured I would start my investigation with FP64 division, which CUDA implements using basic FP64 instructions as a building block. I took some working FP64 division code I wrote for an answer on Stackoverflow some years ago and adapted it to the GPU. On my Turing-based Quadro RTX4000 with a 1/32 ratio of FP64 to FP32 throughput, this results in a speedup of more than 2x in a standalone benchmark. I built on a Windows platform with CUDA 12.3, using this command line:

nvcc -Xcompiler "/W4 /O2 /arch:AVX512 /favor:INTEL64 /fp:precise" -arch=sm_75 -o fp64_div_throughput.exe fp64_div_throughput.cu

Sample output is as follows:

C:\Users\Norbert\My Programs>fp64_div_throughput -d0
CUDA initialization: 0.268 seconds
fp64_div: testing built-in FP64 division
fp64_div: running on device 0 (Quadro RTX 4000, WDDM)
fp64_div: time = 639.177 msec; FP64 div throughput = 15.6451 Gop/sec
fp64_div: test PASSED

C:\Users\Norbert\My Programs>fp64_div_throughput -d0
CUDA initialization: 0.270 seconds
fp64_div: testing emulated FP64 division
fp64_div: running on device 0 (Quadro RTX 4000, WDDM)
fp64_div: time = 255.752 msec; FP64 div throughput = 39.1004 Gop/sec
fp64_div: test PASSED

Please note that this is research-level code; FP64 division is notoriously hard to prove / test for full compliance with IEEE-754 for industrial-strength implementations. One negative aspect of using integer-based emulation is increased register pressure. There is also a considerable expansion in code size. But the result proves the general point that for modern consumer-grade GPUs, using integer-based emulation for all FP64 operations that are not directly supported by the hardware, in particular divisions, square roots, and simple transcendental functions, is an idea worth considering.

I am appending my full code below in case someone wants to try this on an Ampere or Ada based GPU.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

#define USE_BUILTIN_DIV      (0)
#define USE_HALLEY_ITERATION (1)

#define FP64_DIV_DEFAULT_LEN (100000000)
#define FP64_DIV_DEFAULT_DEV (0)
#define FP64_DIV_ITER        (5)

#define THREADS_PER_BLOCK (128)
#define REPS (100)

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

double uint64_as_double (uint64_t a)
{
    double r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint64_t double_as_uint64 (double a)
{
    uint64_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

#define FP64_MANT_BITS       (52)
#define FP64_EXPO_BITS       (11)
#define FP64_EXPO_MASK       (0x7ff0000000000000ULL)
#define FP64_SIGN_MASK       (0x8000000000000000ULL)
#define FP64_MANT_MASK       (0x000fffffffffffffULL)
#define FP64_MAX_EXPO        (0x7ff)
#define FP64_EXPO_BIAS       (1023)
#define FP64_INFTY           (0x7ff0000000000000ULL)
#define FP64_QNAN_BIT        (0x0008000000000000ULL)
#define FP64_QNAN_INDEFINITE (0xfff8000000000000ULL)
#define FP64_MANT_INT_BIT    (0x0010000000000000ULL)

__device__ uint64_t fp64_div_core (uint64_t x, uint64_t y)
{
    uint64_t sign_r, abs_x, abs_y, f, l, p, r, z;
    uint32_t expo_x, expo_y, expo_r, s, x_is_zero, y_is_zero, norm_shift;

    expo_x = (x & FP64_EXPO_MASK) >> FP64_MANT_BITS;
    expo_y = (y & FP64_EXPO_MASK) >> FP64_MANT_BITS;
    sign_r = (x ^ y) & FP64_SIGN_MASK;

    abs_x = x & ~FP64_SIGN_MASK;
    abs_y = y & ~FP64_SIGN_MASK;
    x_is_zero = (abs_x == 0);
    y_is_zero = (abs_y == 0);

    if ((expo_x == FP64_MAX_EXPO) || (expo_y == FP64_MAX_EXPO) || 
        x_is_zero || y_is_zero) [[unlikely]] {
        int x_is_nan = (abs_x >  FP64_INFTY);
        int x_is_inf = (abs_x == FP64_INFTY);
        int y_is_nan = (abs_y >  FP64_INFTY);
        int y_is_inf = (abs_y == FP64_INFTY);
        if (x_is_nan) {
            r = x | FP64_QNAN_BIT;
        } else if (y_is_nan) {
            r = y | FP64_QNAN_BIT;
        } else if ((x_is_zero && y_is_zero) || (x_is_inf && y_is_inf)) {
            r = FP64_QNAN_INDEFINITE;
        } else if (x_is_zero || y_is_inf) {
            r = sign_r;
        } else if (y_is_zero || x_is_inf) {
            r = sign_r | FP64_INFTY;
        }
    } else [[likely]] {
        /* normalize any subnormals */
        if (expo_x == 0) [[unlikely]] {
            s = __clzll (abs_x) - FP64_EXPO_BITS;
            x = x << s;
            expo_x = expo_x - (s - 1);
        }
        if (expo_y == 0) [[unlikely]] {
            s = __clzll (abs_y) - FP64_EXPO_BITS;
            y = y << s;
            expo_y = expo_y - (s - 1);
        }
        //
        expo_r = expo_x - expo_y + FP64_EXPO_BIAS;
        /* extract mantissas */
        x = x & FP64_MANT_MASK;
        y = y & FP64_MANT_MASK;
        /* compute initial approximation r ~= 1/y */
        double tmp = __longlong_as_double ((0x3fcULL << FP64_MANT_BITS) | y);
        asm ("rcp.approx.ftz.f64 %0,%0;" :"+d"(tmp));
        r = (((uint64_t)__double_as_longlong(tmp)) << (FP64_EXPO_BITS - 1)) - 
            (0x400ULL << 32);
        /* make implicit integer bit of mantissas explicit */
        x = x | FP64_MANT_INT_BIT;
        y = y | FP64_MANT_INT_BIT;
        /* pre-scale y for more efficient fixed-point computation */
        z = y << FP64_EXPO_BITS;
#if !(USE_HALLEY_ITERATION)
        /* first NR iteration: r2 = r1*(2-y*r1) */
        p = __umul64hi (z, r << 1);
        f =  0u - p;
        r = __umul64hi (f, r << 1);
        /* second NR iteration: r3 = r2*(2-y*r2) */
        p = __umul64hi (z, r << 1);
        f = 0u - p;
        r = __umul64hi (f, r << 1);
#else // USE_HALLEY_ITERATION
        /* single Halley iteration */
        f =  0x8000000000000000ULL - __umul64hi (z, r << 1); // f = 1 - r * y
        f = __umul64hi (f, f << 1) + f;                      // f = f * f + f
        r = __umul64hi (f, r << 1) + r;                      // r = r + r * f
#endif // USE_HALLEY_ITERATION
        /* compute quotient as wide product x*(1/y) = x*r */
        l = x * (r << 1);
        r = __umul64hi (x, r << 1);
        /* normalize mantissa to be in [1,2) */
        norm_shift = (r & FP64_MANT_INT_BIT) == 0;
        expo_r -= norm_shift;
        r = (r << norm_shift) | ((l >> 1) >> (64 - 1 - norm_shift));
        if ((expo_r > 0) && (expo_r < FP64_MAX_EXPO)) [[likely]] { /* normal */
            int64_t rem_raw, rem_inc;
            int inc;
            /* align x with product y*quotient */
            x = x << (FP64_MANT_BITS + 1 + norm_shift);
            /* compute product y*quotient */
            y = y << 1;
            p = y * r;
            /* compute x - y*quotient, for both raw and incremented quotient */
            rem_raw = x - p;
            rem_inc = rem_raw - y;
            /* round to nearest: tie case _cannot_ occur */
            inc = llabs (rem_inc) < llabs (rem_raw);
            /* build final result from sign, exponent, mantissa */
            r = sign_r | ((((uint64_t)expo_r - 1) << FP64_MANT_BITS) + r + inc);
        } else if ((int)expo_r >= FP64_MAX_EXPO) [[unlikely]] { /* overflow */
            r = sign_r | FP64_INFTY;
        } else [[unlikely]] { /* underflow */
            int denorm_shift = 1 - expo_r;
            if (denorm_shift > (FP64_MANT_BITS + 1)) [[unlikely]] { /* massive underflow */
                r = sign_r;
            } else [[likely]] {
                int64_t rem_raw, rem_inc;
                int inc;
                /* denormalize quotient */
                r = r >> denorm_shift;
                /* align x with product y*quotient */ 
                x = x << (FP64_MANT_BITS + 1 + norm_shift - denorm_shift);
                /* compute product y*quotient */
                y = y << 1;
                p = y * r;
                /* compute x - y*quotient, for both raw & incremented quotient*/
                rem_raw = x - p;
                rem_inc = rem_raw - y;
                /* round to nearest or even: tie case _can_ occur */
                inc = ((llabs (rem_inc) < llabs (rem_raw)) ||
                       (llabs (rem_inc) == llabs (rem_raw) && (r & 1)));
                /* build final result from sign and mantissa */
                r = sign_r | (r + inc);
            }
        }
    }
    return r;
}

__device__ double fp64_div (double a, double b)
{
    long long int x, y, r;
    x = __double_as_longlong (a);
    y = __double_as_longlong (b);
    r = fp64_div_core (x, y);
    return __longlong_as_double (r);
}

__global__ void fp64_div_test (double dividend, double divisor, 
                               double *dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        divisor = __longlong_as_double (__double_as_longlong (divisor) + tid);
#pragma unroll 1
        for (int j = 0; j < REPS; j++) {
#if USE_BUILTIN_DIV
            dividend /= divisor;
#else // USE_BUILTIN_DIV
            dividend = fp64_div (dividend, divisor);
#endif // USE_BUILTIN_DIV
        }
        dst[i] = dividend;
    }
}

struct fp64_div_opts {
    int len;
    int dev;
};

static int processArgs (int argc, char *argv[], struct fp64_div_opts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            case 'd':
                opts->dev = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double init_start, init_stop, start, stop, mintime, elapsed;
    double *quot = 0, *res = 0;
    double dividend = 3.14159265358979, divisor = 0.999999999999999;
    struct fp64_div_opts opts;
    int errors;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : FP64_DIV_DEFAULT_LEN;
    opts.dev = (opts.dev) ? opts.dev : FP64_DIV_DEFAULT_DEV;

    /* Trigger CUDA context creation */
    init_start = second();
    CUDA_SAFE_CALL (cudaFree (0));
    init_stop = second();
    printf ("CUDA initialization: %.3f seconds\n", init_stop - init_start);

    printf ("fp64_div: testing %s FP64 division\n", 
            USE_BUILTIN_DIV ? "built-in" : "emulated");

    /* Select GPU to run on */
    struct cudaDeviceProp props;
    CUDA_SAFE_CALL (cudaSetDevice (opts.dev));
    CUDA_SAFE_CALL (cudaGetDeviceProperties (&props, opts.dev));
    printf ("fp64_div: running on device %d (%s", opts.dev, props.name);
#if defined (_WIN32)
    printf (", %s)\n", props.tccDriver ? "TCC" : "WDDM");
#else
    printf (")\n");
#endif    

    /* Allocate memory on host and device */
    res = (double *) malloc (sizeof(res[0]) * opts.len);
    
    CUDA_SAFE_CALL (cudaMalloc((void**)&quot, sizeof(quot[0]) * opts.len));

    /* Compute execution configuration */
    dim3 dimBlock(THREADS_PER_BLOCK);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);

    mintime = fabs(log(0.0));
    for (int k = 0; k < FP64_DIV_ITER; k++) {
        start = second();
        fp64_div_test<<<dimGrid,dimBlock>>>(dividend, divisor, quot, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("fp64_div: time = %.3f msec; FP64 div throughput = %.4f Gop/sec\n",
            1.0e3 * mintime, (opts.len / mintime) * REPS * 1e-9);

    /* check correctness */
    printf ("fp64_div: checking results ...");
    CUDA_SAFE_CALL (cudaMemcpy (res, quot, opts.len*sizeof(res[0]), cudaMemcpyDeviceToHost));
    for (int k = 0; k < opts.len; k++) {
        double quotient = dividend;
        double dvsr = uint64_as_double (double_as_uint64 (divisor) + k);
        for (int j = 0; j < REPS; j++) {
            quotient /= dvsr;
        }
        if (quotient != res[k]) {
            printf ("!!!@k=%d: res=%23.13a ref=%23.13a\n", k, res[k], quotient);
            return EXIT_FAILURE;
        }
    }
    printf ("\rfp64_div: test PASSED              \n");

    CUDA_SAFE_CALL (cudaFree(quot));
    free (res);
    return EXIT_SUCCESS;
}

striker159 · April 29, 2024, 6:59am

Here are some other results:

RTX 3090 (Ampere)
emulated: 97.8647 Gop/sec
built-in: 37.0682 Gop/sec

RTX 4090 (Ada)
emulated: 210.5130 Gop/sec
built-in: 83.3328 Gop/sec

And on H100 (Hopper) for comparison between consumer and professional/server cards.
emulated: 166.1078 Gop/sec
built-in: 1147.9607 Gop/sec

njuffa · April 29, 2024, 7:35am

Thanks for the additional data. I would have expected a bigger factor in performance advantage than ~ 2.5x for Ampere and Ada given that FP64 runs at 1/64 of FP32 throughput. Maybe integer operations and specifically integer multiplies were also somewhat de-emphasized relative to FP32 on these? I haven’t checked.

The measured performance for the built-in FP64 division checks out on RTX 3090 (expected 39 Gops/sec) and is at least in the ballpark for RTX 4090 (expected 90 Gops/sec). Maybe the benchmarking framework needs some tweaking for those really fast GPUs, higher REPS maybe. It’s always difficult to anticipate how well that stuff scales.

[Later: I dug up some old FP32 multiplication emulation code of mine and massaged it into FP64 multiplication code for the GPU. Unless I am doing something wrong in benchmarking it, the emulation is running neck and neck with the built-in DMUL instruction on my Quadro RTX 4000:

C:\Users\Norbert\My Programs>fp64_mul_throughput
CUDA initialization: 0.263 seconds
fp64_mul: testing built-in FP64 multiplication
fp64_mul: running on device 0 (Quadro RTX 4000, WDDM)
fp64_mul: time = 176.735 msec; FP64 mul throughput = 113.1635 Gop/sec
fp64_mul: test PASSED

C:\Users\Norbert\My Programs>fp64_mul_throughput
CUDA initialization: 0.266 seconds
fp64_mul: testing emulated FP64 multiplication
fp64_mul: running on device 0 (Quadro RTX 4000, WDDM)
fp64_mul: time = 176.742 msec; FP64 mul throughput = 113.1595 Gop/sec
fp64_mul: test PASSED

The measured throughput of DMUL is, within noise level, where I expected it to be (111 Gops/sec) for this GPU. Curiously, the throughput of the integer-based emulation is basically the same. I double checked my benchmarking setup, and nothing seems to be amiss. The benchmark only exercises the fast path, of course, so thread divergence is minimized and may have limited negative impact in real-life scenarios.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

#define USE_BUILTIN_MUL      (0)

#define FP64_MUL_DEFAULT_LEN (100000000)
#define FP64_MUL_DEFAULT_DEV (0)
#define FP64_MUL_ITER        (5)

#define THREADS_PER_BLOCK (128)
#define REPS (200)

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

#define RND_VARIANT (2)     // 1 ... 3

#define FP64_MANT_BITS    (52)
#define FP64_EXPO_BITS    (11)
#define FP64_EXPO_BIAS    (1023)
#define FP64_MANT_MASK    (~((~0ULL) << (FP64_MANT_BITS+1))) /* incl intgr bit*/
#define EXPO_ADJUST       (1)    /* adjustment for performance reasons */
#define MIN_NORM_EXPO     (1)    /* minimum biased exponent of normals */
#define MAX_NORM_EXPO     (2046) /* maximum biased exponent of normals */
#define INF_EXPO          (2047) /* biased exponent of infinities */
#define EXPO_MASK         (~((~0ULL) << FP64_EXPO_BITS))
#define FP64_SIGN_MASK    (0x8000000000000000ULL)
#define FP64_IMPLICIT_BIT (1ULL << FP64_MANT_BITS)
#define RND_BIT_SHIFT     (63)
#define RND_BIT_MASK      (1ULL << RND_BIT_SHIFT)
#define FP64_INFINITY     (0x7ff0000000000000ULL)
#define FP64_INDEFINITE   (0xfff8000000000000ULL)
#define MANT_LSB          (0x0000000000000001ULL)
#define FP64_QNAN_BIT     (0x0008000000000000ULL)
#define MAX_SHIFT         (FP64_MANT_BITS + 2)

__device__ uint64_t fp64_mul_core (uint64_t a, uint64_t b)
{
    uint32_t expoa, expob, expor, shift;
    uint64_t manta, mantb, r, signr, mantr_hi, mantr_lo;

    /* split arguments into sign, exponent, significand */
    expoa = ((a >> FP64_MANT_BITS) & EXPO_MASK) - EXPO_ADJUST;
    expob = ((b >> FP64_MANT_BITS) & EXPO_MASK) - EXPO_ADJUST;
    manta = (a | FP64_IMPLICIT_BIT) & FP64_MANT_MASK;
    mantb = (b | FP64_IMPLICIT_BIT) & FP64_MANT_MASK;
    /* result sign bit: XOR sign argument signs */
    signr = (a ^ b) & FP64_SIGN_MASK;
    if ((expoa >= (MAX_NORM_EXPO - EXPO_ADJUST)) || /* at least one argument is special */
        (expob >= (MAX_NORM_EXPO - EXPO_ADJUST))) [[unlikely]] { 
        if ((a & ~FP64_SIGN_MASK) > FP64_INFINITY) { /* a is NaN */
            /* return quietened NaN */
            return a | FP64_QNAN_BIT;
        }
        if ((b & ~FP64_SIGN_MASK) > FP64_INFINITY) { /* b is NaN */
            /* return quietened NaN */
            return b | FP64_QNAN_BIT;
        }
        if ((a & ~FP64_SIGN_MASK) == 0) { /* a is zero */
            /* return NaN if b is infinity, else zero */
            return (expob != (INF_EXPO - EXPO_ADJUST)) ? signr : FP64_INDEFINITE;
        }
        if ((b & ~FP64_SIGN_MASK) == 0) { /* b is zero */
            /* return NaN if a is infinity, else zero */
            return (expoa != (INF_EXPO - EXPO_ADJUST)) ? signr : FP64_INDEFINITE;
        }
        if (((a & ~FP64_SIGN_MASK) == FP64_INFINITY) || /* a or b infinity */
            ((b & ~FP64_SIGN_MASK) == FP64_INFINITY)) {
            return signr | FP64_INFINITY;
        }
        if ((int32_t)expoa < (MIN_NORM_EXPO - EXPO_ADJUST)) { /* a subnormal */
            /* normalize significand of a */
            manta = a & FP64_MANT_MASK;
            shift = __clzll (manta) - FP64_EXPO_BITS;
            manta = manta << shift;
            expoa = expoa + 1 - shift;
        } else if ((int32_t)expob < (MIN_NORM_EXPO - EXPO_ADJUST)) { /* b subnormal */
            /* normalize significand of b */
            mantb = b & FP64_MANT_MASK;
            shift = __clzll (mantb) - FP64_EXPO_BITS;
            mantb = mantb << shift;
            expob = expob + 1 - shift;
        }
    }
    /* result exponent: add argument exponents and adjust for biasing */
    expor = expoa + expob - FP64_EXPO_BIAS + 2 * EXPO_ADJUST;
    mantb = mantb << FP64_EXPO_BITS; /* preshift to align result signficand */
    /* result significand: multiply argument significands */
    mantr_hi = __umul64hi (manta, mantb);
    mantr_lo = manta * mantb;
    /* normalize significand */
    if (mantr_hi < FP64_IMPLICIT_BIT) {
        mantr_hi = (mantr_hi << 1) | (mantr_lo >> (64 - 1));
        mantr_lo = (mantr_lo << 1);
        expor--;
    }
    if (expor <= (MAX_NORM_EXPO - EXPO_ADJUST)) [[likely]] { /* normal, may overflow to infinity during rounding */
        /* combine biased exponent, sign and signficand */
        r = ((uint64_t)expor << FP64_MANT_BITS) + signr + mantr_hi;
        /* round result to nearest or even; overflow to infinity possible */
#if RND_VARIANT == 1
        r = r + ((mantr_lo | (mantr_hi & MANT_LSB)) > RND_BIT_MASK);
#elif RND_VARIANT == 2
        r = r + (mantr_lo >= RND_BIT_MASK);
        if (mantr_lo == RND_BIT_MASK) r = r & ~MANT_LSB; // tie case
#elif RND_VARIANT == 3
        r = r + ((mantr_lo == RND_BIT_MASK) ? (mantr_hi & MANT_LSB) : (mantr_lo >> RND_BIT_SHIFT));
#else // RND_VARIANT
#error unsupported RND_VARIANT
#endif // RND_VARIANT 
    } else if ((int32_t)expor > (MAX_NORM_EXPO - EXPO_ADJUST)) [[unlikely]] { /* overflow */
        /* return infinity */
        r = signr | FP64_INFINITY;
    } else [[unlikely]] { /* underflow */
        /* return zero, normal, or smallest subnormal */
        shift = 0 - expor;
        if (shift > MAX_SHIFT) shift = MAX_SHIFT;
        /* denormalize significand */
        mantr_lo = mantr_hi << (64 - shift) | (mantr_lo ? 1 : 0);
        mantr_hi = mantr_hi >> shift;
        /* combine sign and signficand; biased exponent known to be zero */
        r = mantr_hi + signr;
        /* round result to nearest or even */
#if RND_VARIANT == 1
        r = r + ((mantr_lo | (mantr_hi & MANT_LSB)) > RND_BIT_MASK);
#elif RND_VARIANT == 2
        r = r + (mantr_lo >= RND_BIT_MASK);
        if (mantr_lo == RND_BIT_MASK) r = r & ~MANT_LSB; // tie case
#elif RND_VARIANT == 3
        r = r + ((mantr_lo == RND_BIT_MASK) ? (mantr_hi & MANT_LSB) : (mantr_lo >> RND_BIT_SHIFT));
#else // RND_VARIANT
#error unsupported RND_VARIANT
#endif // RND_VARIANT 
     }
    return r;
}

double uint64_as_double (uint64_t a)
{
    double r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint64_t double_as_uint64 (double a)
{
    uint64_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}


__device__ double fp64_mul (double a, double b)
{
    long long int x, y, r;
    x = __double_as_longlong (a);
    y = __double_as_longlong (b);
    r = fp64_mul_core (x, y);
    return __longlong_as_double (r);
}

__global__ void fp64_mul_test (double factor1, double factor2, 
                               double *dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        factor2 = __longlong_as_double (__double_as_longlong (factor2) + tid);
#pragma unroll 1
        for (int j = 0; j < REPS; j++) {
#if USE_BUILTIN_MUL
            factor1 *= factor2;
#else // USE_BUILTIN_MUL
            factor1 = fp64_mul (factor1, factor2);
#endif // USE_BUILTIN_MUL
        }
        dst[i] = factor1;
    }
}

struct fp64_mul_opts {
    int len;
    int dev;
};

static int processArgs (int argc, char *argv[], struct fp64_mul_opts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            case 'd':
                opts->dev = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double init_start, init_stop, start, stop, mintime, elapsed;
    double *prod = 0, *res = 0;
    double factor1 = 3.14159265358979, factor2 = 0.999999999999999;
    struct fp64_mul_opts opts;
    int errors;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : FP64_MUL_DEFAULT_LEN;
    opts.dev = (opts.dev) ? opts.dev : FP64_MUL_DEFAULT_DEV;

    /* Trigger CUDA context creation */
    init_start = second();
    CUDA_SAFE_CALL (cudaFree (0));
    init_stop = second();
    printf ("CUDA initialization: %.3f seconds\n", init_stop - init_start);

    printf ("fp64_mul: testing %s FP64 multiplication\n", 
            USE_BUILTIN_MUL ? "built-in" : "emulated");

    /* Select GPU to run on */
    struct cudaDeviceProp props;
    CUDA_SAFE_CALL (cudaSetDevice (opts.dev));
    CUDA_SAFE_CALL (cudaGetDeviceProperties (&props, opts.dev));
    printf ("fp64_mul: running on device %d (%s", opts.dev, props.name);
#if defined (_WIN32)
    printf (", %s)\n", props.tccDriver ? "TCC" : "WDDM");
#else
    printf (")\n");
#endif    

    /* Allocate memory on host and device */
    res = (double *) malloc (sizeof(res[0]) * opts.len);
    
    CUDA_SAFE_CALL (cudaMalloc((void**)&prod, sizeof(prod[0]) * opts.len));

    /* Compute execution configuration */
    dim3 dimBlock(THREADS_PER_BLOCK);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);

    mintime = fabs(log(0.0));
    for (int k = 0; k < FP64_MUL_ITER; k++) {
        start = second();
        fp64_mul_test<<<dimGrid,dimBlock>>>(factor1, factor2, prod, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("fp64_mul: time = %.3f msec; FP64 mul throughput = %.4f Gop/sec\n",
            1.0e3 * mintime, (opts.len / mintime) * REPS * 1e-9);

    /* check correctness */
    printf ("fp64_mul: checking results ...");
    CUDA_SAFE_CALL (cudaMemcpy (res, prod, opts.len*sizeof(res[0]), cudaMemcpyDeviceToHost));
    for (int k = 0; k < opts.len; k++) {
        double product = factor1;
        double fact2 = uint64_as_double (double_as_uint64 (factor2) + k);
        for (int j = 0; j < REPS; j++) {
            product *= fact2;
        }
        if (product != res[k]) {
            printf ("!!!@k=%d: res=%23.13a ref=%23.13a\n", k, res[k], product);
            return EXIT_FAILURE;
        }
    }
    printf ("\rfp64_mul: test PASSED              \n");

    CUDA_SAFE_CALL (cudaFree(prod));
    free (res);
    return EXIT_SUCCESS;
}

striker159 · April 29, 2024, 11:28am

On 3090 and 4090 with multiplication, the emulated code is only a few percent slower than the built-in multiplication.

3090
emulated: 260.4737 Gop/sec
built-in: 267.7842 Gop/sec

4090
emulated: 570.4673 Gop/sec
built-in: 600.5633 Gop/sec

H100
emulated: 480.4940 Gop/sec
built-in: 6727.5708 Gop/sec

Curefab · April 29, 2024, 12:00pm

Yes, FP32 to INT32 was doubled; so FP64 to INT32 stayed the same. (The INT32 cores were changed to hybrid INT32/FP32 cores in addition to the existing FP32 dedicated cores).

njuffa · April 29, 2024, 4:41pm

@Curefab Thanks, that would explain the observations.

Now, as I recall from my work on writing emulation code for ARM, emulating FP64 addition would actually be a bit slower than emulating FP64 multiplication, and FP64 FMA throughput would be half of that.

On a philosophical level it is a bit disconcerting that FP64 support in consumer GPUs has been reduced to such a low level where parts of it could be replaced by software emulation if it were not for the issue of register pressure. From a business perspective it makes sense, though, as NVIDIA is supply constrained on the foundry side, thus slashing hardware components that do not directly drive revenue.

I am considering trying my hand at an FP64 exp() implementation next, to see how simple transcendental functions fare when being emulated.

[Later:]

I finally got around to trying an implementation of exp() based on fixed-point computation. I went with the simplest approach that came to mind, and the result is code with slightly larger maximum error than desired. CUDA’s FP64 exp() implementation has a maximum error of < 0.93 ulp:

Brian Gladman, Vincenzo Innocente, John Mather, Paul Zimmermann, “Accuracy of Mathematical Functions in Single, Double, Double Extended, and Quadruple Precision”, HAL preprint hal-03141101, February 2024

My code below also has maximum error < 0.93 ulp based on a test with 4 billion test vectors, so it provides faithfully-rounded results. As it is right now, my exp() based on fixed-point computation is about 3.25x as fast as CUDA’s built-in exp() on the Turing-based GPU I tested on.

Note: my simplistic benchmarking relies on nesting the operation under test, and with exp() I can do that at most five times before it overflows to infinity. From a quick check my benchmark does not seem to be negatively impacted by a lack of memory bandwidth for writing out the result as the Quadro RTX 4000 has fairly copious bandwidth for the FLOPs it cranks out. This may be different on other GPUs, where performance may become memory bound.

C:\Users\Norbert\My Programs>fp64_exp_throughput.exe
CUDA initialization: 0.199 seconds
fp64_exp: testing built-in FP64 exponentiation
fp64_exp: running on device 0 (Quadro RTX 4000, WDDM)
fp64_exp: time = 132.571 msec; FP64 exp throughput = 7.5431 Gop/sec
fp64_exp: test PASSED

C:\Users\Norbert\My Programs>fp64_exp_throughput.exe
CUDA initialization: 0.203 seconds
fp64_exp: testing emulated FP64 exponentiation
fp64_exp: running on device 0 (Quadro RTX 4000, WDDM)
fp64_exp: time = 40.718 msec; FP64 exp throughput = 24.5594 Gop/sec
fp64_exp: test PASSED

[ Code below updated 5/2/2024, 5/20/2024, 6/25/2024 ]

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

#define USE_BUILTIN_EXP      (0)

#define FP64_EXP_DEFAULT_LEN (200000000)
#define FP64_EXP_DEFAULT_DEV (0)
#define FP64_EXP_ITER        (5)

#define THREADS_PER_BLOCK (128)
#define REPS (5)

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

#define FP64_MANT_BITS    (52)
#define FP64_EXPO_BITS    (11)
#define FP64_EXPO_BIAS    (1023)
#define FP64_MANT_MASK    (~((~0ULL) << (FP64_MANT_BITS+1))) /* incl intgr bit*/
#define FP64_IMPLICIT_BIT (1ULL << FP64_MANT_BITS)
#define FP64_INFINITY     (0x7ff0000000000000ULL)
#define FP64_ZERO         (0x0000000000000000ULL)
#define FP64_ONE          (0x3ff0000000000000ULL)
#define FP64_QNAN_BIT     (0x0008000000000000ULL)
#define FP64_SIGN_BIT     (0x8000000000000000ULL)
#define FRAC_BITS         (FP64_MANT_BITS + 1)
#define FP64_EXPO_MASK    (~((~0ULL) << FP64_EXPO_BITS))
#define L2E_HI            (0xb8aa3b295c17f0bbULL) // log2(exp(1)) * 2**63, msbs
#define L2E_LO            (0xbe87fed0691d3e89ULL) // log2(exp(1)) * 2**63, lsbs
#define UPPER_BOUND       (0x40862e42fefa39efLL)  // above overflow to infinity
#define LOWER_BOUND       (0xc0874910d52d3051ULL) // below underflow to zero
#define U11_53_DNRM_BOUND ((uint64_t)(FP64_EXPO_BIAS - 1) << FRAC_BITS)
#define FP64_IS_NAN(x)    ((x << 1) > 0xffe0000000000000ULL)

__device__ uint64_t umin64 (uint64_t a, uint64_t b)
{
    return (a < b) ? a : b;
}

__device__ uint64_t fp64_exp_core (uint64_t x)
{
    uint64_t mant, f, p, r, t, expo_res, prod_lo, prod_hi;
    uint32_t expo, i, shift;

    /* handle special cases */
    if (((int64_t)x > UPPER_BOUND) || (x > LOWER_BOUND)) [[unlikely]] {
        return FP64_IS_NAN(x) ? (x | FP64_QNAN_BIT) :
            (((int64_t)x < 0) ? FP64_ZERO : FP64_INFINITY);
    }
    
    /* exp(x) = exp2(x*L2E) = exp2(i) + exp2(f) */
    expo = (x >> FP64_MANT_BITS) & FP64_EXPO_MASK;
    shift = FP64_EXPO_BITS - expo + FP64_EXPO_BIAS - 2;
    if (shift > 63) [[unlikely]] return FP64_ONE;

    mant = ((x | FP64_IMPLICIT_BIT) & FP64_MANT_MASK);
    mant = mant << FP64_EXPO_BITS;
    prod_hi = __umul64hi (L2E_HI, mant);
    prod_lo = L2E_HI * mant;
    t = prod_lo;
    prod_lo = prod_lo + __umul64hi (L2E_LO, mant);
    prod_hi = prod_hi + (prod_lo < t);
    prod_lo = (prod_hi << (64 - shift)) | (prod_lo >> shift);
    prod_hi = prod_hi >> shift;
    prod_hi += prod_lo >> 63;
    mant = prod_hi;

    /* handle negative arguments */
    t = ((int64_t)x < 0) ? (0ULL - mant) : mant;

    /* 11.53 fixed-point representation */
    i = t >> FRAC_BITS;
    f = t & FP64_MANT_MASK;

    /* pre-scale fraction for best accuracy of fixed-point computation */
    f = f << (64 - FRAC_BITS);

    /* core approximation: exp2m1(f) for f in [0,1) */
    p =                     0x000000029fb7a52fULL;
    p = __umul64hi (p, f) + 0x0000001c863a94e3ULL;
    p = __umul64hi (p, f) + 0x000001b792436d59ULL;
    p = __umul64hi (p, f) + 0x00001629f9b8f9b4ULL;
    p = __umul64hi (p, f) + 0x0000ffe71472ca07ULL;
    p = __umul64hi (p, f) + 0x000a184838a63eb9ULL;
    p = __umul64hi (p, f) + 0x005761ffb25a04cdULL;
    p = __umul64hi (p, f) + 0x0276556df4cc22e5ULL;
    p = __umul64hi (p, f) + 0x0e35846b8278905dULL;
    p = __umul64hi (p, f) + 0x3d7f7bff058a21dbULL;
    p = __umul64hi (p, f) + 0xb17217f7d1cf7adfULL;
    p = __umul64hi (p, f);

    if (!(((int64_t)x < 0) && (mant > U11_53_DNRM_BOUND))) [[likely]] { // normals
        p = (p + (1ULL << (64 - FP64_MANT_BITS - 1))) >> (64 - FP64_MANT_BITS);
        expo_res = (uint64_t)(i + FP64_EXPO_BIAS) << FRAC_BITS;
        expo_res = expo_res >> (FRAC_BITS - FP64_MANT_BITS);
        r = expo_res + p;
    } else [[unlikely]] { // subnormals
        shift = (mant - U11_53_DNRM_BOUND + FP64_MANT_MASK) >> FRAC_BITS;
        shift = umin64 (shift, 63);
        r = ((((p >> (64 - FP64_MANT_BITS)) + FP64_IMPLICIT_BIT) >> (shift - 1))
             + 1) >> 1;
    }    
    return r;
}

// maximum error found with 4B test vectors: 0.92789 ulp @ -426.97873524932498
__device__ double fp64_exp (double a)
{
    return __longlong_as_double (fp64_exp_core (__double_as_longlong (a)));
}

double uint64_as_double (uint64_t a)
{
    double r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint64_t double_as_uint64 (double a)
{
    uint64_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

__global__ void fp64_exp_test (double argres, double *dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        argres = __longlong_as_double (__double_as_longlong (argres) + tid);
#pragma unroll 1
        for (int j = 0; j < REPS; j++) {
#if USE_BUILTIN_EXP
            argres = exp (argres);
#else // USE_BUILTIN_EXP
            argres = fp64_exp (argres);
#endif // USE_BUILTIN_EXP
        }
        dst[i] = argres;
    }
}

struct fp64_exp_opts {
    int len;
    int dev;
};

static int processArgs (int argc, char *argv[], struct fp64_exp_opts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            case 'd':
                opts->dev = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double init_start, init_stop, start, stop, mintime, elapsed;
    double *exp_res = 0, *res = 0;
    double arg = -1.0;
    struct fp64_exp_opts opts;
    int errors;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : FP64_EXP_DEFAULT_LEN;
    opts.dev = (opts.dev) ? opts.dev : FP64_EXP_DEFAULT_DEV;

    /* Trigger CUDA context creation */
    init_start = second();
    CUDA_SAFE_CALL (cudaFree (0));
    init_stop = second();
    printf ("CUDA initialization: %.3f seconds\n", init_stop - init_start);

    printf ("fp64_exp: testing %s FP64 exponentiation\n", 
            USE_BUILTIN_EXP ? "built-in" : "emulated");

    /* Select GPU to run on */
    struct cudaDeviceProp props;
    CUDA_SAFE_CALL (cudaSetDevice (opts.dev));
    CUDA_SAFE_CALL (cudaGetDeviceProperties (&props, opts.dev));
    printf ("fp64_exp: running on device %d (%s", opts.dev, props.name);
#if defined (_WIN32)
    printf (", %s)\n", props.tccDriver ? "TCC" : "WDDM");
#else
    printf (")\n");
#endif    

    /* Allocate memory on host and device */
    res = (double *) malloc (sizeof(res[0]) * opts.len);
    CUDA_SAFE_CALL (cudaMalloc((void**)&exp_res, sizeof(exp_res[0]) * opts.len));

    /* Compute execution configuration */
    dim3 dimBlock(THREADS_PER_BLOCK);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);

    mintime = fabs(log(0.0));
    for (int k = 0; k < FP64_EXP_ITER; k++) {
        start = second();
        fp64_exp_test<<<dimGrid,dimBlock>>>(arg, exp_res, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("fp64_exp: time = %.3f msec; FP64 exp throughput = %.4f Gop/sec\n",
            1.0e3 * mintime, (opts.len / mintime) * REPS * 1e-9);

    /* check correctness */
    printf ("fp64_exp: checking results ...");
    CUDA_SAFE_CALL (cudaMemcpy (res, exp_res, opts.len*sizeof(res[0]), cudaMemcpyDeviceToHost));
    for (int k = 0; k < opts.len; k++) {
        double argres = uint64_as_double (double_as_uint64 (arg) + k);
        for (int j = 0; j < REPS; j++) {
            argres = exp (argres);
        }
        uint64_t resi = double_as_uint64 (res[k]);
        uint64_t refi = double_as_uint64 (argres);
        uint64_t diff = (resi > refi) ? (resi - refi) : (refi - resi);
        if (diff > 680) {
            printf ("!!!@k=%d: res=%23.13a ref=%23.13a\n", k, res[k], argres);
            return EXIT_FAILURE;
        }
    }
    printf ("\rfp64_exp: test PASSED              \n");

    CUDA_SAFE_CALL (cudaFree(exp_res));
    free (res);
    return EXIT_SUCCESS;
}

Topic		Replies	Views
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	35	100	January 19, 2025
Emulated double precision Double single routine header CUDA Programming and Performance	24	49154	October 18, 2010
A more accurate, performance-competitive implementation of expf() CUDA Programming and Performance	24	8181	November 19, 2017
Unacceptably low accuracy of calculations on the GPU CUDA Programming and Performance cuda	12	1709	May 15, 2024
How to use Nsight Compute Profiling Results CUDA Programming and Performance	7	1116	December 17, 2019
How is FP64 division implemented CUDA Programming and Performance	13	1494	January 15, 2020
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	49891	August 24, 2009
Forward looking GPU integer performance CUDA Programming and Performance	22	21531	March 20, 2017
Fermi architecture details where can I find them? CUDA Programming and Performance	16	4006	April 8, 2012
Possible to use the CUDA math API integer intrinsics to find the nth unset bit in a 32 bit int CUDA Programming and Performance	37	8447	March 1, 2015

Weekend project: FP64 emulation on consumer-grade GPUs

Related topics