functionally identical __device__ functions return differing values.

Hi there!

I’ve been utilizing CUDA as part of my honours project. I am focusing on optimization of an Aho-corasick implementation that uses a lot of population count operations.

In my program I have two versions of a population count that I switch between with #if defined(OPTIMIZATION_POPC). Both of these functions can be seen below:

#if defined(OPTIMIZATION_POPC)

    __device__ int
    popcount_texture(unsigned temp){ 
        // This is the fastest popcount for 32-bit types as detailed in "Software Optimization Guide for AMD64 Processors"(179-180)
        temp = temp - ((temp >> 1) & 0b01010101010101010101010101010101);
        temp = (temp & 0b00110011001100110011001100110011) + ((temp >> 2) & 0b00110011001100110011001100110011);
        return (((temp + (temp >> 4)) & 0b00001111000011110000111100001111) * 0b00000001000000010000000100000001) >> 24;
    }

    __device__ int // Count bits set in a given reduced node
    popcount_node_gpu(unsigned bitmap[], int idx){
        int count = 0;         // number of set bits
        for (int i = 0; i < DIV32(idx); ++i){
            //count += __popc(bitmap[i]); // This also doesn't work
            count += popcount_texture(bitmap[i]);
        }
        //count += __popc(bitmap[i] & ((1<<idx)-1)); // This also doesn't work
        count += popcount_texture(bitmap[idx/32] & ((1<<idx)-1));
        return count;
    }
#else
     // THESE ARE THE ONLY ONES THAT WORK? WHY?

__device__ int
    popcount_texture(unsigned temp){ 
        int i = 0,      // index
            count = 0;  // number of set bits
        do {
            if((temp & (1 << i)) != 0){
                ++count;
            }
            ++i;
        } while (i < 32);
        return count;
    }

    __device__  int // Count bits set in a given reduced node
    popcount_node_gpu(unsigned bitmap[], int idx){
        int i = 0,      // index
            count = 0;  // number of set bits
        do {
            if((bitmap[i/32] & 1 << (i % 32)) != 0){
                ++count;
            }
            ++i;
        } while (i < idx);
        return count;
    }
#endif

I am calling them from my search_trie_global kernel (below) and a few others. I have not been able to get the optimized ones working within my application at all.

The expected behaviour of the device popcount functions is that they would return total population count of given bitmaps (unsigned ints) up to a given index, instead they return 0.

I have tested all of these methods extensively on CPU and found no issues, I have also been able to get them working on small GPU test applications (such as this one -->) https://gist.github.com/AR-Calder/957f2bdcfe00a087ab57a73a13ddd33f - which is what confuses me so much. I really hope I’ve just missed out something stupid because the optimization allows for a ~99% execution time reduction on CPU (500ns → 6ns) and I’d love to see how it performs in this context.

__global__ void
search_trie_global(NodeReduced * __restrict__ trie_array, const char * __restrict__ input_text, const unsigned size_text, unsigned * out) {
    // Get ID of current thread
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= size_text) return;

    unsigned popcount = 0;
    NodeReduced * current_node = &trie_array[0];
    unsigned ascii = input_text[tid];

    for (int i=0; tid+i < size_text; ++i){
        // Get character for comparison
        ascii = input_text[tid + i];
        if (((current_node->bitmap[ascii/32] & (1 << (ascii%32) )) != 0)){
            // Match, get next node
            popcount = popcount_node_gpu(current_node->bitmap, ascii);
            current_node = &trie_array[(current_node->offset) + popcount];
        } else {
            // No match, no point in checking any further
            break;
        }
    }
    out[tid] = (current_node->offset == NodeState::end);
}

Another usage example search_trie_texture

__global__ void
search_trie_texture(const cudaTextureObject_t texture_trie, const char * __restrict__ input_text, const int size_text, unsigned * out) {
    // Get ID of current thread
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= size_text) return;

    int popcount = 0;
    int node_idx = 0;
    int ascii = 0;
    int ascii_idx = 0;

    for (int i=tid; i < size_text; ++i){
        // Get character for comparison
        ascii = input_text[i];
        ascii_idx = ascii/32;

        if((tex1Dfetch<int>(texture_trie, node_idx + ascii_idx) & (1 << (ascii % 32) )) != 0){
            // Match, get next node
            for (int j=0; j < ascii_idx; ++j){
                popcount += popcount_texture(tex1Dfetch<int>(texture_trie, node_idx + j));
            }
            popcount += popcount_texture(tex1Dfetch<int>(texture_trie, node_idx + ascii_idx) & ((1<<MOD32(ascii))-1));
            node_idx = TEXTURE_WIDTH*(tex1Dfetch<int>(texture_trie, node_idx + OFFSET_IDX) + popcount);
        } else {
            // No match, no point in checking any further
            break;
        }
        popcount = 0;
    }
    out[tid] = (tex1Dfetch<int>(texture_trie, node_idx + OFFSET_IDX) == NodeState::end);
}

When I am testing these functions I reuse the same data so there shouldn’t be any difference in output.

I’ve been trying to diagnose the issue for several days now and I really don’t understand why it works in my test case but not in implementation. Can anyone offer any insight as to what on earth is going on?

I’ve tested it with cuda-memcheck which has returned 0 errors. My GPU is the primary output on my system so (afaik) I am unable to debug with cuda-gdb. Not sure if this matters but I am using NVCC 10.0 on Fedora 28.

Why are you using your own popcount implementations when CUDA provides device functions intrinsics for it that map to popcount instructions provided by GPU hardware and that definitely work correctly?

I am looking at your two implementations of popcount_node_gpu() and they do not look equivalent to me on initial perusal. Furthermore, it is not clear to me what functionality you are actually trying to implement with this function. Can you provide a complete, precise specification of what popcount_node_gpu() is supposed to do? I would have expected such a specification to precede the code in the form of a comment block. In particular, what is the argument “idx”, and what values can it take?

It would be best if you would provide a minimal complete (buildable, runnable) example that demonstrates the issues with your popcount functions, leaving out anything else (such as the search_trie_* functions).

[Later:]
It seems to me the bug in the fast version is in this line trying to handle an end-case (incomplete chunk). What values can ‘idx’ take? What shift count do you need?

__popc(bitmap[ i] & ((1<<idx)-1));

Here is some code including a minimal unit test for your perusal. I omitted all CUDA error checking, which is definitely not recommended.

#include <cstdio>
#include <cstdlib>
#include <cstdint>

// George Marsaglia's KISS PRNG
// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

#define SLOW (0)
#define FAST (1)

#define BITS_PER_CHUNK (32)
#if (BITS_PER_CHUNK == 32)
#define T uint32_t
#else // BITS_PER_CHUNK
#error unsupported value of BITS_PER_CHUNK
#endif // BITS_PER_CHUNK

__device__ int popcount_node_gpu_slow (const T *bitmap, int idx)
{
    int count = 0; // number of set bits
    int i = 0; // bit index
    do {
        count += bitmap [i / BITS_PER_CHUNK] & (1 << (i % BITS_PER_CHUNK)) ? 1 : 0;
        i++;
    } while (i < idx);
    return count;
}

__device__ int popcount_node_gpu_fast (const T *bitmap, int idx)
{
    int count = 0; // number of set bits
    int i; // chunk index
    for (i = 0; i < (idx / BITS_PER_CHUNK); i++){
        count += __popc (bitmap [i]);
    }
    count += __popc (bitmap [i] & (~((~0u) << (idx % BITS_PER_CHUNK))));
    return count;
}
        
__global__ void kernel (const T *bitmap, int idx, int algo, int *bitcount)
{
    if (algo == SLOW) {
        bitcount[0] = popcount_node_gpu_slow (bitmap, idx);
    } else {
        bitcount[0] = popcount_node_gpu_fast (bitmap, idx);
    }
}

#define NBR_TESTS   (10)
#define NBR_CHUNKS  (4)
#define BITMAP_SIZE (NBR_CHUNKS * BITS_PER_CHUNK)
int main (void)
{
    T bitmap [NBR_CHUNKS], *bitmap_d = 0;
    int slowcount, fastcount, *slowcount_d = 0, *fastcount_d = 0;

    cudaMalloc ((void **)&bitmap_d, sizeof bitmap);
    cudaMalloc ((void **)&slowcount_d, sizeof slowcount_d);
    cudaMalloc ((void **)&fastcount_d, sizeof fastcount_d);
    
    for (int test = 0; test < NBR_TESTS; test++) {

        /* generate a new random bit map and copy it to the device */
        for (int i = 0; i < NBR_CHUNKS; i++) {
            bitmap [i] = KISS;
        }
        cudaMemcpy (bitmap_d, bitmap, sizeof bitmap, cudaMemcpyHostToDevice);

        /* count number of 1-bits in bitmap at any position up to the maximum */
        for (int idx = 1; idx < BITMAP_SIZE; idx++) {
            kernel<<<1,1>>>(bitmap_d, idx, SLOW, slowcount_d);
            cudaMemcpy (&slowcount, slowcount_d, sizeof slowcount, cudaMemcpyDeviceToHost);
            kernel<<<1,1>>>(bitmap_d, idx, FAST, fastcount_d);
            cudaMemcpy (&fastcount, fastcount_d, sizeof fastcount, cudaMemcpyDeviceToHost);
            if (slowcount != fastcount) {
                printf ("error:  idx=%d  slowcount=%d  fastcount=%d\n", idx, slowcount, fastcount);
                printf ("bitmap: ");
                for (int j = 0; j < NBR_CHUNKS; j++) printf ("%08x ", bitmap[j]);
                printf ("\n");
                cudaDeviceSynchronize();
                return EXIT_FAILURE;
            }
        }
    }
    cudaDeviceSynchronize();

    cudaFree (slowcount_d);
    cudaFree (fastcount_d);
    cudaFree (bitmap_d);

    return EXIT_SUCCESS;
}

The reason I was using my own is because the original implementation used their own - I wanted a direct comparison, and I also wanted to see if any famous CPU methods were faster than the intrinsic method as I was previously able to beat the gcc __builtin_popcount.

popcount_node_gpu() takes a 8-length array of unsigned ints, which in total represent 256 bits. The idx (index) value states how many of those bits should be counted. As such, idx is always in the range of 0-255, this is guaranteed because it pulls from a char->int cast. The basic version compares the bits in each element to 1 << i, which increments through all possible bit indices. The optimized version uses a bitmask ((1<<idx)-1)) to 0 any bits beyond the specified index and then uses a divide and conquer method to sum the total bits. TL;DR: it sums 256 bits up to a given index.

In writing this I just rubber-ducky’d myself and came to the same conclusion you did; ((1<<idx)-1)) is too big, it should actually be ((1<<(idx%32))-1)).

Thanks for the assist - I’ve looked over that function so many times I’ve become code-blind.

Given that your slow variant uses a do-while loop, I doubt it will work correctly when idx is 0. BTW, ‘idx’ seems like a sub-optimal name for that variable. It represents a bit count and does not function as a bit index, which the name seems to suggest.

I beg to differ. Consider idx=65. The partial chunk is going to be in bitmap[2], and the mask we need for the partial chunk is 0x00000001.

A simple unit test should have found it, though. Writing good unit tests is a valuable skill. It can save you some grief and embarrassment.

True story: I once worked on an optimized string library. Worked great, very fast. A senior colleague used it to do a demo in front of a customer. The demo blew up. My unit test had failed to check whether my code properly handled strings of length zero, which actually occur in real-life applications! My colleague wasn’t pleased and made sure to let me know.