replacing float with double: simple reduction

I’m trying to replace floats with doubles in the following code from http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf:

template
global void reduce6(int g_idata, int g_odata, unsigned int n)
{
extern shared int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x
(blockSize
2) + tid;
unsigned int gridSize = blockSize2gridDim.x;
sdata[tid] = 0;
while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; }
__syncthreads();
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
if (tid < 32) {
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

It only works correctly for a thread size of 2. I know it’s got to do with doubles taking twice as much space, but where in the algorithm would you have to halve something / remove a tree branch?

any help is appreciated :D

This makes no sense. You are using int all over the place (e.g. for input parameters, shared memory). Where exactly did you use double?

sorry, copied the example.

here’s what I have:

MAIN:

int threads = 128;
dim3 dimBlock(threads);
dim3 dimGrid((total_size)/dimBlock.x);
int smemSize = threads * sizeof(double);

reduce6<<<dimGrid, dimBlock, smemSize>>>(d_A.elements, d_result, total_size);

KERNEL:

template
global void reduce6(double *g_idata, double *g_odata, unsigned int n){
extern shared double sdata;

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * (blockSize*2) + tid;
unsigned int gridSize = blockSize * 2 * gridDim.x;
sdata[tid] = 0;

while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+ blockSize]; i += gridSize; }
__syncthreads();

if (blockSize >= 512){ if (tid < 256) {sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256){ if (tid < 128) {sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128){ if (tid < 64) {sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

if (tid < 32){
    if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
    if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
    if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
    if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
    if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
    if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

Tried using the first kernel which seems simplest. Trying to reverse engineer the mistake, it seems to work fine for certain Block and Grid dimension inputs. But not universally. So it’s hard to tell if it’s related to the kernel or the memory distribution parameters in the kernel call.

global void reduce0(double *g_idata, double *g_odata) {
extern shared double sdata;

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();


for (unsigned int s=1; s < blockDim.x; s *= 2) {
    if (tid % (2*s) == 0) {
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}
/*
for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
    if (tid < s) {
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}
*/
if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

I suggest providing a complete code if you want help. Your reduce6 example could not possibly be correct. Your kernel is templated, but your kernel invocation provides no template parameter - the compiler should throw an error there.

These kernels do expect that the block size is a power of 2, and also your grid dimensioning code expects that the total size is evenly divisible by the block size.

Hint: zero padding the input data works on your reduction problem if your problem size is not a power of two. Then make sure to pass a power of two problem size to the code.

Also be aware that the result of adding an array of floating point variables (fp16, fp32, fp64) depends on the order in which the numbers are added. That order may be dependent on warp and block scheduling behavior of the GPU and may not even be consistent across multiple runs of the same code.

So don’t expect a numerical identical result to a “Gold” solution on the CPU that simply adds up all numbers sequentially.