Modified reduction __popc() Value doubling....

andradx · January 31, 2011, 4:26pm

Hello there,

I’m building an application that involves simulating a forward-error correcting system. At the last level of the execution I need to evaluate how many bits are set to “1” in a bit stream array. Since there’s an optimized reduction algorithm shipped with CUDA SDK I’ve decided to use it, but I needed to tweak it a bit. Not in terms of memory access but how data is interpreted:

I store the bit streams in a unsigned 32 bit integer, where each bit of the 16 least significant bits store the bit state for each of the 16 bit stream (data-parallelism is 16 bit streams generated per kernel). So I have a word of the sort: 0xXXXX QWER where X has no meaning and “QWER”=“b_15 b_14 b_13 … b_1 b_0” (b_j is the bit of the jth bit stream).

Since I need to evaluate how many bits are set to “1” across all the bit streams, I modified the reduction6 kernel to:

template <class T, unsigned int blockSize, bool nIsPow2> __global__ void reduce6(T *g_idata, T *g_odata, unsigned int n)

{

    T *sdata = SharedMemory<T>();

// perform first level of reduction,

    // reading from global memory, writing to shared memory

    unsigned int tid = threadIdx.x;

    unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;

    unsigned int gridSize = blockSize*2*gridDim.x;

T mySum=0x00000000;

// we reduce multiple elements per thread.  The number is determined by the 

    // number of active thread blocks (via gridDim).  More blocks will result

    // in a larger gridSize and therefore fewer elements per thread

    while (i < n)

    {         

        mySum += __popc((0x0000FFFF & g_idata[i]));

// ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays

        if (nIsPow2 || i + blockSize < n) 

            mySum += __popc((0x0000FFFF & g_idata[i+blockSize]));  

        i += gridSize;

    } 

// each thread puts its local sum into shared memory 

    sdata[tid] = mySum;

    __syncthreads();

// do reduction in shared mem

    if (blockSize >= 512) { if (tid < 256) { sdata[tid] = mySum = mySum + sdata[tid + 256]; } __syncthreads(); }

    if (blockSize >= 256) { if (tid < 128) { sdata[tid] = mySum = mySum + sdata[tid + 128]; } __syncthreads(); }

    if (blockSize >= 128) { if (tid <  64) { sdata[tid] = mySum = mySum + sdata[tid +  64]; } __syncthreads(); }

#ifndef __DEVICE_EMULATION__

    if (tid < 32)

#endif

    {

        // now that we are using warp-synchronous programming (below)

        // we need to declare our shared memory volatile so that the compiler

        // doesn't reorder stores to it and induce incorrect behavior.

        volatile T* smem = sdata;

        if (blockSize >=  64) { smem[tid] = mySum = mySum + smem[tid + 32]; EMUSYNC; }

        if (blockSize >=  32) { smem[tid] = mySum = mySum + smem[tid + 16]; EMUSYNC; }

        if (blockSize >=  16) { smem[tid] = mySum = mySum + smem[tid +  8]; EMUSYNC; }

        if (blockSize >=   8) { smem[tid] = mySum = mySum + smem[tid +  4]; EMUSYNC; }

        if (blockSize >=   4) { smem[tid] = mySum = mySum + smem[tid +  2]; EMUSYNC; }

        if (blockSize >=   2) { smem[tid] = mySum = mySum + smem[tid +  1]; EMUSYNC; }

    }

// write result for this block to global mem 

    if (tid == 0)

        g_odata[blockIdx.x] += sdata[0];

}

Each time a word is loaded from global memory I ensure that the “XXXX” bits (the 16 MSB) of the 32 bit integer will not influence the results.

Hence, the loaded word is AND’d with a bitmask 0x0000FFFF:

 (0x0000FFFF & g_idata[/*some index shown in the code above*/])

Furthermore, since I’m interested in the number of bits set to “1” I then use the bitmasked input data as an argument to the built-in function __popc() that outputs precisely the bits set to “1” in an integer word.

However, debugging my algorithm with an array *array={0x0000FFFB for first element, 0x00000001 for second elementer, 0x00000000 otherwise} outputs in the final reduction step: 22 set bits.

Such is not true since 0x0000FFFB is 65531, has 15 bits set to “1” (all bits are set to “1” except the 3rd LSB) and 0x00000001 has one set bit to “1”. So the evaluation should be 16 and not 22.

Can anyone spot what I might be doing wrong?! For the record I’m using 32 blocks of 128 threads to reduce a bit stream of length 64800, since the nearest power of 2 is very close (65536) I’m using a 65536 array.

Thank you.

avidday · January 31, 2011, 4:31pm

Is this on a Fermi card?

andradx · January 31, 2011, 4:34pm

This is a Fermi card, a Tesla C2050 with turned off ECC.

avidday · January 31, 2011, 4:52pm

I would be having a close look at the PTX of the warp level reduction phase of that code.

On Fermi, you need to be careful with warp level implicit synchronization and shared memory. Fermi doesn’t have instructions to directly operate on shared memory, values must get loaded and stored to register first. Compiler optimization can leave things in register which would otherwise get written back to shared memory. That is what the comment and the volatile pointer is for in the warp level code block - declaring things volatile forces a write back to memory. But now you have introduced both a volatile shared memory and a register variable into each assignment. I am not sure what the compiler will do in that case. Optimization might break what should otherwise work.

andradx · January 31, 2011, 4:53pm

Problem solved:

In the global memory store instruction the correct instruction should be:

g_odata[blockIdx.x] = sdata[0];

and not

g_odata[blockIdx.x] += sdata[0];

Mea culpa, that instruction is a legacy from previous versions of my application.

Either way, thanks avidday, you should be awarded either a fast reply award or a restraining order from stalking all [nearly all] my topics. External Image

avidday · January 31, 2011, 6:02pm

Occam’s Razor strikes again. I didn’t notice the summation at the end either…

Topic		Replies	Views
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1829	January 14, 2009
Reduction Problem CUDA Programming and Performance	5	4767	October 13, 2010
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6730	June 2, 2012
reduction6 kernel from CUDA SDK not working correctly CUDA Programming and Performance	2	1844	August 10, 2010
Reduction Problem CUDA Programming and Performance	2	2116	May 9, 2010
a short reduction example cannot get what is expected CUDA Programming and Performance	4	2227	March 28, 2012
NVIDIA SDK Example on Reduction Reduction Ver 1: from the Whitepaper on Reductions CUDA Programming and Performance	3	3397	July 23, 2009
float reduction, cpu and cuda answers differ CUDA Programming and Performance	4	3345	April 1, 2008
Reduction kernel for Fermi CUDA Programming and Performance	8	1659	June 11, 2010
porting opencl pseudocode to CUDA CUDA Programming and Performance	11	3130	June 5, 2011

Modified reduction __popc() Value doubling....

Related topics