Hello there,
I’m building an application that involves simulating a forward-error correcting system. At the last level of the execution I need to evaluate how many bits are set to “1” in a bit stream array. Since there’s an optimized reduction algorithm shipped with CUDA SDK I’ve decided to use it, but I needed to tweak it a bit. Not in terms of memory access but how data is interpreted:
I store the bit streams in a unsigned 32 bit integer, where each bit of the 16 least significant bits store the bit state for each of the 16 bit stream (data-parallelism is 16 bit streams generated per kernel). So I have a word of the sort: 0xXXXX QWER where X has no meaning and “QWER”=“b_15 b_14 b_13 … b_1 b_0” (b_j is the bit of the jth bit stream).
Since I need to evaluate how many bits are set to “1” across all the bit streams, I modified the reduction6 kernel to:
template <class T, unsigned int blockSize, bool nIsPow2> __global__ void reduce6(T *g_idata, T *g_odata, unsigned int n)
{
T *sdata = SharedMemory<T>();
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
T mySum=0x00000000;
// we reduce multiple elements per thread. The number is determined by the
// number of active thread blocks (via gridDim). More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{
mySum += __popc((0x0000FFFF & g_idata[i]));
// ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
if (nIsPow2 || i + blockSize < n)
mySum += __popc((0x0000FFFF & g_idata[i+blockSize]));
i += gridSize;
}
// each thread puts its local sum into shared memory
sdata[tid] = mySum;
__syncthreads();
// do reduction in shared mem
if (blockSize >= 512) { if (tid < 256) { sdata[tid] = mySum = mySum + sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] = mySum = mySum + sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] = mySum = mySum + sdata[tid + 64]; } __syncthreads(); }
#ifndef __DEVICE_EMULATION__
if (tid < 32)
#endif
{
// now that we are using warp-synchronous programming (below)
// we need to declare our shared memory volatile so that the compiler
// doesn't reorder stores to it and induce incorrect behavior.
volatile T* smem = sdata;
if (blockSize >= 64) { smem[tid] = mySum = mySum + smem[tid + 32]; EMUSYNC; }
if (blockSize >= 32) { smem[tid] = mySum = mySum + smem[tid + 16]; EMUSYNC; }
if (blockSize >= 16) { smem[tid] = mySum = mySum + smem[tid + 8]; EMUSYNC; }
if (blockSize >= 8) { smem[tid] = mySum = mySum + smem[tid + 4]; EMUSYNC; }
if (blockSize >= 4) { smem[tid] = mySum = mySum + smem[tid + 2]; EMUSYNC; }
if (blockSize >= 2) { smem[tid] = mySum = mySum + smem[tid + 1]; EMUSYNC; }
}
// write result for this block to global mem
if (tid == 0)
g_odata[blockIdx.x] += sdata[0];
}
Each time a word is loaded from global memory I ensure that the “XXXX” bits (the 16 MSB) of the 32 bit integer will not influence the results.
Hence, the loaded word is AND’d with a bitmask 0x0000FFFF:
(0x0000FFFF & g_idata[/*some index shown in the code above*/])
Furthermore, since I’m interested in the number of bits set to “1” I then use the bitmasked input data as an argument to the built-in function __popc() that outputs precisely the bits set to “1” in an integer word.
However, debugging my algorithm with an array *array={0x0000FFFB for first element, 0x00000001 for second elementer, 0x00000000 otherwise} outputs in the final reduction step: 22 set bits.
Such is not true since 0x0000FFFB is 65531, has 15 bits set to “1” (all bits are set to “1” except the 3rd LSB) and 0x00000001 has one set bit to “1”. So the evaluation should be 16 and not 22.
Can anyone spot what I might be doing wrong?! For the record I’m using 32 blocks of 128 threads to reduce a bit stream of length 64800, since the nearest power of 2 is very close (65536) I’m using a 65536 array.
Thank you.