A fun weekend diversion: BCD addition on the GPU

BCD (binary-code decimal) is a method of performing decimal arithmetic by storing each decimal digit in a tetrade of bits, colloquially referred to as a nibble. It used to be quite common in commercial data processing and I even recall using BCD-based floating-point arithmetic in the early 1980s.

Some 25 years ago I wrote some code for BCD integer addition to demonstrate conclusively that it was safe to remove BCD support instructions from a particular instruction set architecture. While it was pretty obvious, it is always good to have some concrete demonstration at hand. This weekend I was reminded of this by a thread on Stackoverflow, so I figured I should take a look to see how it would pan out on the GPU.

Constructing BCD addition on top of regular binary addition is conceptually fairly trivial. When summing two source nibbles one temporarily adds 6 to force carry-propagation between nibbles. At the end one subtracts 6 from all nibbles that did not produce a carry-out to the next higher nibble.

My approach in 1999 was to detect the carry propagation by the effect it has on the LSB of the next higher nibble. It is trivial to determine the value of the LSB after addition as the Exclusive-OR of the LSB of the two source operands. If the LSB of a nibble has a different value after addition, a carry-in from the nibble below it must have occurred. This works straightforward and quite efficiently with the minor issue of capturing the carry-out from the most significant nibble, which is supposed to propagate to the next higher 8-digit chunk.

When Knuth published volume 4a of The Art of Computer Programming in 2011, I noticed that he proposed a different method of nibble carry detection, by considering the MSB of each nibble of both sources as well as the sum. It turns out the functional relationship maps to the majority-of-3 function (which Knuth prefers to call the median), and as I have demonstrated previously in this forum, modern GPUs are able to map this operation to a single LOP3 instruction which can implement any logical function of three inputs. Using the majority-of-3 operation also makes for very concise and elegant code.

Since GPUs are 32-bit machines, it is best to process BCD operands in chunks of eight digits each of which fills a 32-bit register. With the algorithm from TAOCP the basic building blocks handling those eight digits only require 9 to 11 instructions, so overall we need just a bit more than one instruction per digit. Which is pretty efficient but of course also significantly slower than handling binary arithmetic. Here I am showing sample machine code generated by CUDA 12.3 for an sm_75 target:

BCD32_ADDcc(unsigned int, unsigned int, unsigned int*):
 IADD3 R3, R4, 0x66666666, R5 
 IADD3 R5, -R5, -0x66666667, RZ 
 LOP3.LUT R5, R3, R5, R4, 0xd4, !PT 
 LOP3.LUT R9, RZ, R5, RZ, 0x33, !PT 
 LOP3.LUT R4, R5, 0x88888888, RZ, 0xc0, !PT 
 SHF.R.U32.HI R9, RZ, 0x1f, R9 
 IMAD.IADD R3, R3, 0x1, -R4 
 ST.E.SYS [R6], R9 
 LEA.HI R4, R4, R3, RZ, 0x1e 
 RET.ABS.NODEC R20 0x0 

Here is my little demo program:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define USE_KNUTH (1)

/* majority-of-3 operation that Knuth refers to more generically as median */
__device__ uint32_t median32 (uint32_t x, uint32_t y, uint32_t z)
{
    return (x | (y & z)) & (y | z);
}

/* 8-digit BCD addition with carry-out */
__device__ uint32_t BCD32_ADDcc (uint32_t x, uint32_t y, uint32_t *cy)
{
#if USE_KNUTH
    // Knuth, TAOCP 4a (2011), answer to exercise 100 of section 7.1.3
    uint32_t z, u, t;
    z = y + 0x66666666;
    u = x + z;
    t = median32 (~x, ~z, u) & 0x88888888;
    *cy = (~t) >> 31;
    return (u - t) + (t >> 2);
#else // ! USE_KNUTH
    const uint32_t sixes = 0x66666666;
    const uint32_t eights = 0x88888888;
    uint32_t s, u, carry_out;
    x = x + sixes;          // add "nibble offset"; cannot overflow
    s = x ^ y;              // preliminary sum bits
    u = x + y;
    carry_out = u < x;      // carry-out from most significant nibble
    s = (carry_out << 31) | ((s ^ u) >> 1); // sum bits with carry-in
    s = ~s & eights;        // isolate nibble lsbs without carry-in
    *cy = carry_out;        // record carry-out from most significant nibble
    u = (u - s) + (s >> 2); // subtract 6 from nibbles withtout carry-out
    return u;
#endif // USE_KNUTH
}

/* 8-digit BCD addition with carry-in */
__device__ uint32_t BCD32_ADDC (uint32_t x, uint32_t y, uint32_t cy)
{
#if USE_KNUTH
    // Knuth, TAOCP 4a (2011), answer to exercise 100 of section 7.1.3
    uint32_t z, u, t;
    x = x + cy;
    z = y + 0x66666666;
    u = x + z;
    t = median32 (~x, ~z, u) & 0x88888888;
    return (u - t) + (t >> 2);
#else // !USE_KNUTH
    const uint32_t sixes = 0x66666666;
    const uint32_t eights = 0x88888888;
    uint32_t s, u, carry_out;
    y = y + cy;             // incorporate carry-in; cannot overflow
    x = x + sixes;          // add "nibble offset"; cannot overflow
    s = x ^ y;              // preliminary sum bits
    u = x + y;              
    carry_out = u < x;      // carry-out from most significant nibble
    s = (carry_out << 31) | ((s ^ u) >> 1); // sum bits with carry-in
    s = ~s & eights;        // isolate nibble lsbs without carry-in
    u = (u - s) + (s >> 2); // subtract 6 from nibbles withtout carry-out
    return u;
#endif // USE_KNUTH
}

/* 8-digit BCD addition with carry-in and carry-out */
__device__ uint32_t BCD32_ADDCcc (uint32_t x, uint32_t y, uint32_t *cy)
{
#if USE_KNUTH
    // Knuth, TAOCP 4a (2011), answer to exercise 100 of section 7.1.3
    uint32_t z, u, t;
    x = x + (*cy);
    z = y + 0x66666666;
    u = x + z;
    t = median32 (~x, ~z, u) & 0x88888888;
    *cy = (~t) >> 31;
    return (u - t) + (t >> 2);
#else  // !USE_KNUTH
    const uint32_t sixes = 0x66666666;
    const uint32_t eights = 0x88888888;
    uint32_t s, u, carry_out;
    y = y + (*cy);          // incorporate carry-in; cannot overflow
    x = x + sixes;          // add "nibble offset"; cannot overflow
    s = x ^ y;              // preliminary sum bits
    u = x + y;              
    carry_out = u < x;      // carry-out from most significant nibble
    s = (carry_out << 31) | ((s ^ u) >> 1); // sum bits with carry-in
    s = ~s & eights;        // isolate nibble lsbs without carry-in
    *cy = carry_out;        // record carry-out from most significant nibble
    u = (u - s) + (s >> 2); // subtract 6 from nibbles withtout carry-out
    return u;
#endif // USE_KNUTH
}

/* 24-digit BCD addition */
__device__ void bcd_add (const uint32_t *a, const uint32_t *b, uint32_t *r)
{
    uint32_t cy;
    r[0] = BCD32_ADDcc  (a[0], b[0], &cy);
    r[1] = BCD32_ADDCcc (a[1], b[1], &cy);
    r[2] = BCD32_ADDC   (a[2], b[2],  cy);
}


__global__ void kernel (const uint32_t *a, const uint32_t *b, uint32_t *r)
{
    bcd_add (a, b, r);
}

int main (void)
{
    const uint32_t a[3] = {0x76543210, 0x54321098, 0x32109876 };
    const uint32_t b[3] = {0x98765432, 0x76543210, 0x54321098 };
    uint32_t r[3];
    uint32_t *ad = 0, *bd =0, *rd=0;

    cudaMalloc ((void**)&ad, sizeof a);
    cudaMalloc ((void**)&bd, sizeof b);
    cudaMalloc ((void**)&rd, sizeof r);
    cudaMemcpy (ad, a, sizeof a, cudaMemcpyHostToDevice);
    cudaMemcpy (bd, b, sizeof b, cudaMemcpyHostToDevice);
    kernel<<<1,1>>>(ad, bd, rd);
    cudaMemcpy (r, rd, sizeof r, cudaMemcpyDeviceToHost);
    printf ("a   = %08x_%08x_%08x\n", a[2], a[1], a[0]);
    printf ("b   = %08x_%08x_%08x\n", b[2], b[1], b[0]);
    printf ("sum = %08x_%08x_%08x\n", r[2], r[1], r[0]);
    printf ("ref = 86430975_30864309_75308642 (courtesy of Wolfram Alpha)\n");
    cudaFree (ad);
    cudaFree (bd);
    cudaFree (rd);

    return EXIT_SUCCESS;
}
2 Likes