Modified reduction __popc() Value doubling....

Hello there,

I’m building an application that involves simulating a forward-error correcting system. At the last level of the execution I need to evaluate how many bits are set to “1” in a bit stream array. Since there’s an optimized reduction algorithm shipped with CUDA SDK I’ve decided to use it, but I needed to tweak it a bit. Not in terms of memory access but how data is interpreted:

I store the bit streams in a unsigned 32 bit integer, where each bit of the 16 least significant bits store the bit state for each of the 16 bit stream (data-parallelism is 16 bit streams generated per kernel). So I have a word of the sort: 0xXXXX QWER where X has no meaning and “QWER”=“b_15 b_14 b_13 … b_1 b_0” (b_j is the bit of the jth bit stream).

Since I need to evaluate how many bits are set to “1” across all the bit streams, I modified the reduction6 kernel to:

template <class T, unsigned int blockSize, bool nIsPow2> __global__ void reduce6(T *g_idata, T *g_odata, unsigned int n)


    T *sdata = SharedMemory<T>();

// perform first level of reduction,

    // reading from global memory, writing to shared memory

    unsigned int tid = threadIdx.x;

    unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;

    unsigned int gridSize = blockSize*2*gridDim.x;

T mySum=0x00000000;

// we reduce multiple elements per thread.  The number is determined by the 

    // number of active thread blocks (via gridDim).  More blocks will result

    // in a larger gridSize and therefore fewer elements per thread

    while (i < n)


        mySum += __popc((0x0000FFFF & g_idata[i]));

// ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays

        if (nIsPow2 || i + blockSize < n) 

            mySum += __popc((0x0000FFFF & g_idata[i+blockSize]));  

        i += gridSize;


// each thread puts its local sum into shared memory 

    sdata[tid] = mySum;


// do reduction in shared mem

    if (blockSize >= 512) { if (tid < 256) { sdata[tid] = mySum = mySum + sdata[tid + 256]; } __syncthreads(); }

    if (blockSize >= 256) { if (tid < 128) { sdata[tid] = mySum = mySum + sdata[tid + 128]; } __syncthreads(); }

    if (blockSize >= 128) { if (tid <  64) { sdata[tid] = mySum = mySum + sdata[tid +  64]; } __syncthreads(); }


    if (tid < 32)



        // now that we are using warp-synchronous programming (below)

        // we need to declare our shared memory volatile so that the compiler

        // doesn't reorder stores to it and induce incorrect behavior.

        volatile T* smem = sdata;

        if (blockSize >=  64) { smem[tid] = mySum = mySum + smem[tid + 32]; EMUSYNC; }

        if (blockSize >=  32) { smem[tid] = mySum = mySum + smem[tid + 16]; EMUSYNC; }

        if (blockSize >=  16) { smem[tid] = mySum = mySum + smem[tid +  8]; EMUSYNC; }

        if (blockSize >=   8) { smem[tid] = mySum = mySum + smem[tid +  4]; EMUSYNC; }

        if (blockSize >=   4) { smem[tid] = mySum = mySum + smem[tid +  2]; EMUSYNC; }

        if (blockSize >=   2) { smem[tid] = mySum = mySum + smem[tid +  1]; EMUSYNC; }


// write result for this block to global mem 

    if (tid == 0)

        g_odata[blockIdx.x] += sdata[0];


Each time a word is loaded from global memory I ensure that the “XXXX” bits (the 16 MSB) of the 32 bit integer will not influence the results.

Hence, the loaded word is AND’d with a bitmask 0x0000FFFF:

 (0x0000FFFF & g_idata[/*some index shown in the code above*/])

Furthermore, since I’m interested in the number of bits set to “1” I then use the bitmasked input data as an argument to the built-in function __popc() that outputs precisely the bits set to “1” in an integer word.

However, debugging my algorithm with an array *array={0x0000FFFB for first element, 0x00000001 for second elementer, 0x00000000 otherwise} outputs in the final reduction step: 22 set bits.

Such is not true since 0x0000FFFB is 65531, has 15 bits set to “1” (all bits are set to “1” except the 3rd LSB) and 0x00000001 has one set bit to “1”. So the evaluation should be 16 and not 22.

Can anyone spot what I might be doing wrong?! For the record I’m using 32 blocks of 128 threads to reduce a bit stream of length 64800, since the nearest power of 2 is very close (65536) I’m using a 65536 array.

Thank you.

Is this on a Fermi card?

This is a Fermi card, a Tesla C2050 with turned off ECC.

I would be having a close look at the PTX of the warp level reduction phase of that code.

On Fermi, you need to be careful with warp level implicit synchronization and shared memory. Fermi doesn’t have instructions to directly operate on shared memory, values must get loaded and stored to register first. Compiler optimization can leave things in register which would otherwise get written back to shared memory. That is what the comment and the volatile pointer is for in the warp level code block - declaring things volatile forces a write back to memory. But now you have introduced both a volatile shared memory and a register variable into each assignment. I am not sure what the compiler will do in that case. Optimization might break what should otherwise work.

Problem solved:

In the global memory store instruction the correct instruction should be:

g_odata[blockIdx.x] = sdata[0];

and not

g_odata[blockIdx.x] += sdata[0];

Mea culpa, that instruction is a legacy from previous versions of my application.

Either way, thanks avidday, you should be awarded either a fast reply award or a restraining order from stalking all [nearly all] my topics. External Image

Occam’s Razor strikes again. I didn’t notice the summation at the end either…