A fun weekend diversion: BCD addition on the GPU

njuffa · March 25, 2024, 7:29am

BCD (binary-code decimal) is a method of performing decimal arithmetic by storing each decimal digit in a tetrade of bits, colloquially referred to as a nibble. It used to be quite common in commercial data processing and I even recall using BCD-based floating-point arithmetic in the early 1980s.

Some 25 years ago I wrote some code for BCD integer addition to demonstrate conclusively that it was safe to remove BCD support instructions from a particular instruction set architecture. While it was pretty obvious, it is always good to have some concrete demonstration at hand. This weekend I was reminded of this by a thread on Stackoverflow, so I figured I should take a look to see how it would pan out on the GPU.

Constructing BCD addition on top of regular binary addition is conceptually fairly trivial. When summing two source nibbles one temporarily adds 6 to force carry-propagation between nibbles. At the end one subtracts 6 from all nibbles that did not produce a carry-out to the next higher nibble.

My approach in 1999 was to detect the carry propagation by the effect it has on the LSB of the next higher nibble. It is trivial to determine the value of the LSB after addition as the Exclusive-OR of the LSB of the two source operands. If the LSB of a nibble has a different value after addition, a carry-in from the nibble below it must have occurred. This works straightforward and quite efficiently with the minor issue of capturing the carry-out from the most significant nibble, which is supposed to propagate to the next higher 8-digit chunk.

When Knuth published volume 4a of The Art of Computer Programming in 2011, I noticed that he proposed a different method of nibble carry detection, by considering the MSB of each nibble of both sources as well as the sum. It turns out the functional relationship maps to the majority-of-3 function (which Knuth prefers to call the median), and as I have demonstrated previously in this forum, modern GPUs are able to map this operation to a single LOP3 instruction which can implement any logical function of three inputs. Using the majority-of-3 operation also makes for very concise and elegant code.

Since GPUs are 32-bit machines, it is best to process BCD operands in chunks of eight digits each of which fills a 32-bit register. With the algorithm from TAOCP the basic building blocks handling those eight digits only require 9 to 11 instructions, so overall we need just a bit more than one instruction per digit. Which is pretty efficient but of course also significantly slower than handling binary arithmetic. Here I am showing sample machine code generated by CUDA 12.3 for an sm_75 target:

BCD32_ADDcc(unsigned int, unsigned int, unsigned int*):
 IADD3 R3, R4, 0x66666666, R5 
 IADD3 R5, -R5, -0x66666667, RZ 
 LOP3.LUT R5, R3, R5, R4, 0xd4, !PT 
 LOP3.LUT R9, RZ, R5, RZ, 0x33, !PT 
 LOP3.LUT R4, R5, 0x88888888, RZ, 0xc0, !PT 
 SHF.R.U32.HI R9, RZ, 0x1f, R9 
 IMAD.IADD R3, R3, 0x1, -R4 
 ST.E.SYS [R6], R9 
 LEA.HI R4, R4, R3, RZ, 0x1e 
 RET.ABS.NODEC R20 0x0

Here is my little demo program:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define USE_KNUTH (1)

/* majority-of-3 operation that Knuth refers to more generically as median */
__device__ uint32_t median32 (uint32_t x, uint32_t y, uint32_t z)
{
    return (x | (y & z)) & (y | z);
}

/* 8-digit BCD addition with carry-out */
__device__ uint32_t BCD32_ADDcc (uint32_t x, uint32_t y, uint32_t *cy)
{
#if USE_KNUTH
    // Knuth, TAOCP 4a (2011), answer to exercise 100 of section 7.1.3
    uint32_t z, u, t;
    z = y + 0x66666666;
    u = x + z;
    t = median32 (~x, ~z, u) & 0x88888888;
    *cy = (~t) >> 31;
    return (u - t) + (t >> 2);
#else // ! USE_KNUTH
    const uint32_t sixes = 0x66666666;
    const uint32_t eights = 0x88888888;
    uint32_t s, u, carry_out;
    x = x + sixes;          // add "nibble offset"; cannot overflow
    s = x ^ y;              // preliminary sum bits
    u = x + y;
    carry_out = u < x;      // carry-out from most significant nibble
    s = (carry_out << 31) | ((s ^ u) >> 1); // sum bits with carry-in
    s = ~s & eights;        // isolate nibble lsbs without carry-in
    *cy = carry_out;        // record carry-out from most significant nibble
    u = (u - s) + (s >> 2); // subtract 6 from nibbles withtout carry-out
    return u;
#endif // USE_KNUTH
}

/* 8-digit BCD addition with carry-in */
__device__ uint32_t BCD32_ADDC (uint32_t x, uint32_t y, uint32_t cy)
{
#if USE_KNUTH
    // Knuth, TAOCP 4a (2011), answer to exercise 100 of section 7.1.3
    uint32_t z, u, t;
    x = x + cy;
    z = y + 0x66666666;
    u = x + z;
    t = median32 (~x, ~z, u) & 0x88888888;
    return (u - t) + (t >> 2);
#else // !USE_KNUTH
    const uint32_t sixes = 0x66666666;
    const uint32_t eights = 0x88888888;
    uint32_t s, u, carry_out;
    y = y + cy;             // incorporate carry-in; cannot overflow
    x = x + sixes;          // add "nibble offset"; cannot overflow
    s = x ^ y;              // preliminary sum bits
    u = x + y;              
    carry_out = u < x;      // carry-out from most significant nibble
    s = (carry_out << 31) | ((s ^ u) >> 1); // sum bits with carry-in
    s = ~s & eights;        // isolate nibble lsbs without carry-in
    u = (u - s) + (s >> 2); // subtract 6 from nibbles withtout carry-out
    return u;
#endif // USE_KNUTH
}

/* 8-digit BCD addition with carry-in and carry-out */
__device__ uint32_t BCD32_ADDCcc (uint32_t x, uint32_t y, uint32_t *cy)
{
#if USE_KNUTH
    // Knuth, TAOCP 4a (2011), answer to exercise 100 of section 7.1.3
    uint32_t z, u, t;
    x = x + (*cy);
    z = y + 0x66666666;
    u = x + z;
    t = median32 (~x, ~z, u) & 0x88888888;
    *cy = (~t) >> 31;
    return (u - t) + (t >> 2);
#else  // !USE_KNUTH
    const uint32_t sixes = 0x66666666;
    const uint32_t eights = 0x88888888;
    uint32_t s, u, carry_out;
    y = y + (*cy);          // incorporate carry-in; cannot overflow
    x = x + sixes;          // add "nibble offset"; cannot overflow
    s = x ^ y;              // preliminary sum bits
    u = x + y;              
    carry_out = u < x;      // carry-out from most significant nibble
    s = (carry_out << 31) | ((s ^ u) >> 1); // sum bits with carry-in
    s = ~s & eights;        // isolate nibble lsbs without carry-in
    *cy = carry_out;        // record carry-out from most significant nibble
    u = (u - s) + (s >> 2); // subtract 6 from nibbles withtout carry-out
    return u;
#endif // USE_KNUTH
}

/* 24-digit BCD addition */
__device__ void bcd_add (const uint32_t *a, const uint32_t *b, uint32_t *r)
{
    uint32_t cy;
    r[0] = BCD32_ADDcc  (a[0], b[0], &cy);
    r[1] = BCD32_ADDCcc (a[1], b[1], &cy);
    r[2] = BCD32_ADDC   (a[2], b[2],  cy);
}


__global__ void kernel (const uint32_t *a, const uint32_t *b, uint32_t *r)
{
    bcd_add (a, b, r);
}

int main (void)
{
    const uint32_t a[3] = {0x76543210, 0x54321098, 0x32109876 };
    const uint32_t b[3] = {0x98765432, 0x76543210, 0x54321098 };
    uint32_t r[3];
    uint32_t *ad = 0, *bd =0, *rd=0;

    cudaMalloc ((void**)&ad, sizeof a);
    cudaMalloc ((void**)&bd, sizeof b);
    cudaMalloc ((void**)&rd, sizeof r);
    cudaMemcpy (ad, a, sizeof a, cudaMemcpyHostToDevice);
    cudaMemcpy (bd, b, sizeof b, cudaMemcpyHostToDevice);
    kernel<<<1,1>>>(ad, bd, rd);
    cudaMemcpy (r, rd, sizeof r, cudaMemcpyDeviceToHost);
    printf ("a   = %08x_%08x_%08x\n", a[2], a[1], a[0]);
    printf ("b   = %08x_%08x_%08x\n", b[2], b[1], b[0]);
    printf ("sum = %08x_%08x_%08x\n", r[2], r[1], r[0]);
    printf ("ref = 86430975_30864309_75308642 (courtesy of Wolfram Alpha)\n");
    cudaFree (ad);
    cudaFree (bd);
    cudaFree (rd);

    return EXIT_SUCCESS;
}

Topic		Replies	Views
Big Integer Arithmetic Anyone trying to do bign ints on CUDA? CUDA Programming and Performance	2	9820	February 19, 2008
BigInteger CUDA Programming and Performance	10	5835	July 15, 2011
Large Integers on CUDA CUDA Programming and Performance	2	8914	June 25, 2010
Simple vector addition n Cuda CUDA Programming and Performance	4	1810	May 19, 2011
Integer carry chains CUDA Programming and Performance	5	8702	November 18, 2010
throughput of integer add CUDA Programming and Performance	17	3195	August 15, 2011
Writing a function in PTX? Need to hand-code a function in PTX CUDA Programming and Performance	3	3323	September 10, 2008
nvopencc backend with 'addc' intrinsics new builtins for 32-bit CUDA compiler under Linux CUDA Programming and Performance	2	1783	November 9, 2008
Prevent ncc from applying MADD optimization precision, IEEE 754 CUDA Programming and Performance	11	11071	April 6, 2009
Carry generation Carry generation in OpenCL CUDA Programming and Performance	0	4305	July 29, 2010

A fun weekend diversion: BCD addition on the GPU

Related topics