sum 10 of 1s, got 9.9999; sum 100000 of 1s, got 1.8535X10^193;

hi Folks:

This is weird. I have following kernel summing over all entries of an vector using reduction. I have tested it a few days ago, and it l looked successful. However, today, when I tested again, trying to sum from 10000 of 1s, I got 10016! Here is my kernel code:

__global__ void myreduce

(       double *g_odata, 

        double *g_idata, 

        unsigned int n

) 

{

        __shared__ volatile double  sdata[blockSize];

        unsigned int tid = threadIdx.x;

        unsigned int i = blockIdx.x*(blockSize*2) + tid;

        unsigned int gridSize = blockSize*2*gridDim.x;

        sdata[tid] = 0;

        while (i < n)

        {

                sdata[tid] += g_idata[i] + g_idata[i+blockSize]; 

                i += gridSize; 

        }

        __syncthreads();

        if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

        if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

        if (blockSize >= 128) { if (tid < 64)  { sdata[tid] += sdata[tid + 64]; }  __syncthreads(); }

        if (tid < 32)

        {       

                if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

                if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

                if (blockSize >= 16) sdata[tid] += sdata[tid + 8];

                if (blockSize >= 8)  sdata[tid] += sdata[tid + 4];

                if (blockSize >= 4)  sdata[tid] += sdata[tid + 2];

                if (blockSize >= 2)  sdata[tid] += sdata[tid + 1];

        }

        //if (tid == 0) g_odata[blockIdx.x] = sdata[0];

        if (tid == 0) atomicAdd(&g_odata[0], sdata[0]);

}

What is more, when I sum from 100000 of 1s, I got 1.8535X10^193! In general, what happens is, when summing over small vectors ( up to about 10000 1s), I always got fractional results. For example:

sum 10 of 1s, I got 9.9999

sum 100 of 1s, I got 100.0006

sum 1000 of 1s, I got 999.9998

sum 10000 of 1s, I got 10016

When the size of vector goes up, I got:

sum 100000 of 1s, I got 1.8535X10^193

But what is even weird, when size kept going up, I got it right!

sum 1000000 of 1s, I got 1000000

sum 10000000 of 1s, I got 10000000

sum 100000000 of 1s, I got 100000000

Anyone please could tell me what is happening here? Same code, working fine days ago…then becoming crazy today…

Is this has anything to do with the GPU card? We are sharing a same GPU card on a remote Ubuntu machine…

What is the atomicAdd there for? If you haven’t been very careful in making sure that the contents of g_odata have been zeroed before every call, there is your problem.

I zeroed g_odata, in my MATLAB code which I did not show here, every time when this kernel is called. Otherwise atomicAdd() will accumulate results from each time, which did not happen to me.

I don’t think it was the zeroing problem.

AFAICS this reduction kernel implicitly supposes that you reduces a vector of length which is a power of two. To see for yourself, just try the algorithm in your mind with a vector of length 3; for thread of id 3 it gives you “sdata[3] += sdata[4]”, which leads to an out of bound access.
If you need it to work for an arbitrary vector length, you have to amend it. One simple solution to get this is to add a test “&& (tid+xx)<blocksize)” in all your individual reduction lines (adapting the xx to the current value).

Edit: Well, actually, it’s a bit more complicated than this: what is implicit here is that blocksize is a power of two, and n is a multiple of 2*blocksize (not so sure of this). I guess you can handle blocksize to be a power of two in the calling side, so let’s assume this is met. And let’s also assume that blocksize==blockDim.x
But you still have a hazard here: “sdata[tid] += g_idata[i] + g_idata[i+blockSize]” might lead to an out of bound access if i is indeed inferior to n (as you test it) but i+blocksize is greater than n.

All in all, this kernel albeit extremely effective (I guess) needs to be straighten for corner cases. My initial correction was an error (I guess) but the problem I pointed do exist.

First of all, you have declared g_odata as double*, but atomicAdd should only be supported for floats (and integers). I am surprised that the code actually compiles with this

I think Gilles_C is correct about the problem with out of bound access. One easy solution would be

unsigned int i = blockIdx.x*blockSize + tid;

unsigned int gridSize = blockSize*gridDim.x;

sdata[tid] = 0;

while (i < n)

{

  sdata[tid] += g_idata[i];

  i += gridSize;

}

which should prevent this error.