Techniques for quasi double precision computation using pairs of float
operands are well known and well suited to GPUs. However, the effective precision achieved by this approach is a bit lower than when using true double
computation. More importantly, one remains limited by the narrow dynamic range (exponent range) of float
.
Ever since NVIDIA introduced consumer GPUs that execute FP64 operations at 1/64 of the rate of FP32 operations, I have been wondering whether it could make sense to simply emulate proper IEEE-754 FP64 arithmetic using integer-based fixed-point computation. Back in the days of 200 MHz FPU-less 32-bit ARM processors, one could emulate an FP64 multiply in less than 60 cycles using manually crafted assembly code. The instruction set of 32-bit ARM processors is particularly well suited to such endeavors, presumably by design, as floating-point coprocessors were expensive add-ons when the architecture was first designed. The integer instructions provided by the GPU are not quite as sophisticated.
For this reason, replicating this feat with compiled CUDA code seemed a bit daunting, so I figured I would start my investigation with FP64 division, which CUDA implements using basic FP64 instructions as a building block. I took some working FP64 division code I wrote for an answer on Stackoverflow some years ago and adapted it to the GPU. On my Turing-based Quadro RTX4000 with a 1/32 ratio of FP64 to FP32 throughput, this results in a speedup of more than 2x in a standalone benchmark. I built on a Windows platform with CUDA 12.3, using this command line:
nvcc -Xcompiler "/W4 /O2 /arch:AVX512 /favor:INTEL64 /fp:precise" -arch=sm_75 -o fp64_div_throughput.exe fp64_div_throughput.cu
Sample output is as follows:
C:\Users\Norbert\My Programs>fp64_div_throughput -d0
CUDA initialization: 0.268 seconds
fp64_div: testing built-in FP64 division
fp64_div: running on device 0 (Quadro RTX 4000, WDDM)
fp64_div: time = 639.177 msec; FP64 div throughput = 15.6451 Gop/sec
fp64_div: test PASSED
C:\Users\Norbert\My Programs>fp64_div_throughput -d0
CUDA initialization: 0.270 seconds
fp64_div: testing emulated FP64 division
fp64_div: running on device 0 (Quadro RTX 4000, WDDM)
fp64_div: time = 255.752 msec; FP64 div throughput = 39.1004 Gop/sec
fp64_div: test PASSED
Please note that this is research-level code; FP64 division is notoriously hard to prove / test for full compliance with IEEE-754 for industrial-strength implementations. One negative aspect of using integer-based emulation is increased register pressure. There is also a considerable expansion in code size. But the result proves the general point that for modern consumer-grade GPUs, using integer-based emulation for all FP64 operations that are not directly supported by the hardware, in particular divisions, square roots, and simple transcendental functions, is an idea worth considering.
I am appending my full code below in case someone wants to try this on an Ampere or Ada based GPU.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#define USE_BUILTIN_DIV (0)
#define USE_HALLEY_ITERATION (1)
#define FP64_DIV_DEFAULT_LEN (100000000)
#define FP64_DIV_DEFAULT_DEV (0)
#define FP64_DIV_ITER (5)
#define THREADS_PER_BLOCK (128)
#define REPS (100)
// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call) \
do { \
cudaError_t err = call; \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR() \
do { \
/* Check synchronous errors, i.e. pre-launch */ \
cudaError_t err = cudaGetLastError(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
/* Check asynchronous errors, i.e. kernel failed (ULF) */ \
err = cudaDeviceSynchronize(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString( err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
double uint64_as_double (uint64_t a)
{
double r;
memcpy (&r, &a, sizeof r);
return r;
}
uint64_t double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
#define FP64_MANT_BITS (52)
#define FP64_EXPO_BITS (11)
#define FP64_EXPO_MASK (0x7ff0000000000000ULL)
#define FP64_SIGN_MASK (0x8000000000000000ULL)
#define FP64_MANT_MASK (0x000fffffffffffffULL)
#define FP64_MAX_EXPO (0x7ff)
#define FP64_EXPO_BIAS (1023)
#define FP64_INFTY (0x7ff0000000000000ULL)
#define FP64_QNAN_BIT (0x0008000000000000ULL)
#define FP64_QNAN_INDEFINITE (0xfff8000000000000ULL)
#define FP64_MANT_INT_BIT (0x0010000000000000ULL)
__device__ uint64_t fp64_div_core (uint64_t x, uint64_t y)
{
uint64_t sign_r, abs_x, abs_y, f, l, p, r, z;
uint32_t expo_x, expo_y, expo_r, s, x_is_zero, y_is_zero, norm_shift;
expo_x = (x & FP64_EXPO_MASK) >> FP64_MANT_BITS;
expo_y = (y & FP64_EXPO_MASK) >> FP64_MANT_BITS;
sign_r = (x ^ y) & FP64_SIGN_MASK;
abs_x = x & ~FP64_SIGN_MASK;
abs_y = y & ~FP64_SIGN_MASK;
x_is_zero = (abs_x == 0);
y_is_zero = (abs_y == 0);
if ((expo_x == FP64_MAX_EXPO) || (expo_y == FP64_MAX_EXPO) ||
x_is_zero || y_is_zero) [[unlikely]] {
int x_is_nan = (abs_x > FP64_INFTY);
int x_is_inf = (abs_x == FP64_INFTY);
int y_is_nan = (abs_y > FP64_INFTY);
int y_is_inf = (abs_y == FP64_INFTY);
if (x_is_nan) {
r = x | FP64_QNAN_BIT;
} else if (y_is_nan) {
r = y | FP64_QNAN_BIT;
} else if ((x_is_zero && y_is_zero) || (x_is_inf && y_is_inf)) {
r = FP64_QNAN_INDEFINITE;
} else if (x_is_zero || y_is_inf) {
r = sign_r;
} else if (y_is_zero || x_is_inf) {
r = sign_r | FP64_INFTY;
}
} else [[likely]] {
/* normalize any subnormals */
if (expo_x == 0) [[unlikely]] {
s = __clzll (abs_x) - FP64_EXPO_BITS;
x = x << s;
expo_x = expo_x - (s - 1);
}
if (expo_y == 0) [[unlikely]] {
s = __clzll (abs_y) - FP64_EXPO_BITS;
y = y << s;
expo_y = expo_y - (s - 1);
}
//
expo_r = expo_x - expo_y + FP64_EXPO_BIAS;
/* extract mantissas */
x = x & FP64_MANT_MASK;
y = y & FP64_MANT_MASK;
/* compute initial approximation r ~= 1/y */
double tmp = __longlong_as_double ((0x3fcULL << FP64_MANT_BITS) | y);
asm ("rcp.approx.ftz.f64 %0,%0;" :"+d"(tmp));
r = (((uint64_t)__double_as_longlong(tmp)) << (FP64_EXPO_BITS - 1)) -
(0x400ULL << 32);
/* make implicit integer bit of mantissas explicit */
x = x | FP64_MANT_INT_BIT;
y = y | FP64_MANT_INT_BIT;
/* pre-scale y for more efficient fixed-point computation */
z = y << FP64_EXPO_BITS;
#if !(USE_HALLEY_ITERATION)
/* first NR iteration: r2 = r1*(2-y*r1) */
p = __umul64hi (z, r << 1);
f = 0u - p;
r = __umul64hi (f, r << 1);
/* second NR iteration: r3 = r2*(2-y*r2) */
p = __umul64hi (z, r << 1);
f = 0u - p;
r = __umul64hi (f, r << 1);
#else // USE_HALLEY_ITERATION
/* single Halley iteration */
f = 0x8000000000000000ULL - __umul64hi (z, r << 1); // f = 1 - r * y
f = __umul64hi (f, f << 1) + f; // f = f * f + f
r = __umul64hi (f, r << 1) + r; // r = r + r * f
#endif // USE_HALLEY_ITERATION
/* compute quotient as wide product x*(1/y) = x*r */
l = x * (r << 1);
r = __umul64hi (x, r << 1);
/* normalize mantissa to be in [1,2) */
norm_shift = (r & FP64_MANT_INT_BIT) == 0;
expo_r -= norm_shift;
r = (r << norm_shift) | ((l >> 1) >> (64 - 1 - norm_shift));
if ((expo_r > 0) && (expo_r < FP64_MAX_EXPO)) [[likely]] { /* normal */
int64_t rem_raw, rem_inc;
int inc;
/* align x with product y*quotient */
x = x << (FP64_MANT_BITS + 1 + norm_shift);
/* compute product y*quotient */
y = y << 1;
p = y * r;
/* compute x - y*quotient, for both raw and incremented quotient */
rem_raw = x - p;
rem_inc = rem_raw - y;
/* round to nearest: tie case _cannot_ occur */
inc = llabs (rem_inc) < llabs (rem_raw);
/* build final result from sign, exponent, mantissa */
r = sign_r | ((((uint64_t)expo_r - 1) << FP64_MANT_BITS) + r + inc);
} else if ((int)expo_r >= FP64_MAX_EXPO) [[unlikely]] { /* overflow */
r = sign_r | FP64_INFTY;
} else [[unlikely]] { /* underflow */
int denorm_shift = 1 - expo_r;
if (denorm_shift > (FP64_MANT_BITS + 1)) [[unlikely]] { /* massive underflow */
r = sign_r;
} else [[likely]] {
int64_t rem_raw, rem_inc;
int inc;
/* denormalize quotient */
r = r >> denorm_shift;
/* align x with product y*quotient */
x = x << (FP64_MANT_BITS + 1 + norm_shift - denorm_shift);
/* compute product y*quotient */
y = y << 1;
p = y * r;
/* compute x - y*quotient, for both raw & incremented quotient*/
rem_raw = x - p;
rem_inc = rem_raw - y;
/* round to nearest or even: tie case _can_ occur */
inc = ((llabs (rem_inc) < llabs (rem_raw)) ||
(llabs (rem_inc) == llabs (rem_raw) && (r & 1)));
/* build final result from sign and mantissa */
r = sign_r | (r + inc);
}
}
}
return r;
}
__device__ double fp64_div (double a, double b)
{
long long int x, y, r;
x = __double_as_longlong (a);
y = __double_as_longlong (b);
r = fp64_div_core (x, y);
return __longlong_as_double (r);
}
__global__ void fp64_div_test (double dividend, double divisor,
double *dst, int len)
{
int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
divisor = __longlong_as_double (__double_as_longlong (divisor) + tid);
#pragma unroll 1
for (int j = 0; j < REPS; j++) {
#if USE_BUILTIN_DIV
dividend /= divisor;
#else // USE_BUILTIN_DIV
dividend = fp64_div (dividend, divisor);
#endif // USE_BUILTIN_DIV
}
dst[i] = dividend;
}
}
struct fp64_div_opts {
int len;
int dev;
};
static int processArgs (int argc, char *argv[], struct fp64_div_opts *opts)
{
int error = 0;
memset (opts, 0, sizeof(*opts));
while (argc) {
if (*argv[0] == '-') {
switch (*(argv[0]+1)) {
case 'n':
opts->len = atol(argv[0]+2);
break;
case 'd':
opts->dev = atol(argv[0]+2);
break;
default:
fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
error++;
break;
}
}
argc--;
argv++;
}
return error;
}
int main (int argc, char *argv[])
{
double init_start, init_stop, start, stop, mintime, elapsed;
double *quot = 0, *res = 0;
double dividend = 3.14159265358979, divisor = 0.999999999999999;
struct fp64_div_opts opts;
int errors;
errors = processArgs (argc, argv, &opts);
if (errors) {
return EXIT_FAILURE;
}
opts.len = (opts.len) ? opts.len : FP64_DIV_DEFAULT_LEN;
opts.dev = (opts.dev) ? opts.dev : FP64_DIV_DEFAULT_DEV;
/* Trigger CUDA context creation */
init_start = second();
CUDA_SAFE_CALL (cudaFree (0));
init_stop = second();
printf ("CUDA initialization: %.3f seconds\n", init_stop - init_start);
printf ("fp64_div: testing %s FP64 division\n",
USE_BUILTIN_DIV ? "built-in" : "emulated");
/* Select GPU to run on */
struct cudaDeviceProp props;
CUDA_SAFE_CALL (cudaSetDevice (opts.dev));
CUDA_SAFE_CALL (cudaGetDeviceProperties (&props, opts.dev));
printf ("fp64_div: running on device %d (%s", opts.dev, props.name);
#if defined (_WIN32)
printf (", %s)\n", props.tccDriver ? "TCC" : "WDDM");
#else
printf (")\n");
#endif
/* Allocate memory on host and device */
res = (double *) malloc (sizeof(res[0]) * opts.len);
CUDA_SAFE_CALL (cudaMalloc((void**)", sizeof(quot[0]) * opts.len));
/* Compute execution configuration */
dim3 dimBlock(THREADS_PER_BLOCK);
int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
dim3 dimGrid(threadBlocks);
mintime = fabs(log(0.0));
for (int k = 0; k < FP64_DIV_ITER; k++) {
start = second();
fp64_div_test<<<dimGrid,dimBlock>>>(dividend, divisor, quot, opts.len);
CHECK_LAUNCH_ERROR();
stop = second();
elapsed = stop - start;
if (elapsed < mintime) mintime = elapsed;
}
printf ("fp64_div: time = %.3f msec; FP64 div throughput = %.4f Gop/sec\n",
1.0e3 * mintime, (opts.len / mintime) * REPS * 1e-9);
/* check correctness */
printf ("fp64_div: checking results ...");
CUDA_SAFE_CALL (cudaMemcpy (res, quot, opts.len*sizeof(res[0]), cudaMemcpyDeviceToHost));
for (int k = 0; k < opts.len; k++) {
double quotient = dividend;
double dvsr = uint64_as_double (double_as_uint64 (divisor) + k);
for (int j = 0; j < REPS; j++) {
quotient /= dvsr;
}
if (quotient != res[k]) {
printf ("!!!@k=%d: res=%23.13a ref=%23.13a\n", k, res[k], quotient);
return EXIT_FAILURE;
}
}
printf ("\rfp64_div: test PASSED \n");
CUDA_SAFE_CALL (cudaFree(quot));
free (res);
return EXIT_SUCCESS;
}