sum 10 of 1s, got 9.9999; sum 100000 of 1s, got 1.8535X10^193;

nickxtsui · April 3, 2012, 11:27pm

hi Folks:

This is weird. I have following kernel summing over all entries of an vector using reduction. I have tested it a few days ago, and it l looked successful. However, today, when I tested again, trying to sum from 10000 of 1s, I got 10016! Here is my kernel code:

__global__ void myreduce

(       double *g_odata, 

        double *g_idata, 

        unsigned int n

) 

{

        __shared__ volatile double  sdata[blockSize];

        unsigned int tid = threadIdx.x;

        unsigned int i = blockIdx.x*(blockSize*2) + tid;

        unsigned int gridSize = blockSize*2*gridDim.x;

        sdata[tid] = 0;

        while (i < n)

        {

                sdata[tid] += g_idata[i] + g_idata[i+blockSize]; 

                i += gridSize; 

        }

        __syncthreads();

        if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

        if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

        if (blockSize >= 128) { if (tid < 64)  { sdata[tid] += sdata[tid + 64]; }  __syncthreads(); }

        if (tid < 32)

        {       

                if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

                if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

                if (blockSize >= 16) sdata[tid] += sdata[tid + 8];

                if (blockSize >= 8)  sdata[tid] += sdata[tid + 4];

                if (blockSize >= 4)  sdata[tid] += sdata[tid + 2];

                if (blockSize >= 2)  sdata[tid] += sdata[tid + 1];

        }

        //if (tid == 0) g_odata[blockIdx.x] = sdata[0];

        if (tid == 0) atomicAdd(&g_odata[0], sdata[0]);

}

What is more, when I sum from 100000 of 1s, I got 1.8535X10^193! In general, what happens is, when summing over small vectors ( up to about 10000 1s), I always got fractional results. For example:

sum 10 of 1s, I got 9.9999

sum 100 of 1s, I got 100.0006

sum 1000 of 1s, I got 999.9998

sum 10000 of 1s, I got 10016

When the size of vector goes up, I got:

sum 100000 of 1s, I got 1.8535X10^193

But what is even weird, when size kept going up, I got it right!

sum 1000000 of 1s, I got 1000000

sum 10000000 of 1s, I got 10000000

sum 100000000 of 1s, I got 100000000

Anyone please could tell me what is happening here? Same code, working fine days ago…then becoming crazy today…

Is this has anything to do with the GPU card? We are sharing a same GPU card on a remote Ubuntu machine…

avidday · April 4, 2012, 3:57am

What is the atomicAdd there for? If you haven’t been very careful in making sure that the contents of g_odata have been zeroed before every call, there is your problem.

nickxtsui · April 4, 2012, 4:14am

I zeroed g_odata, in my MATLAB code which I did not show here, every time when this kernel is called. Otherwise atomicAdd() will accumulate results from each time, which did not happen to me.

I don’t think it was the zeroing problem.

Gilles_C · April 4, 2012, 4:15am

AFAICS this reduction kernel implicitly supposes that you reduces a vector of length which is a power of two. To see for yourself, just try the algorithm in your mind with a vector of length 3; for thread of id 3 it gives you “sdata[3] += sdata[4]”, which leads to an out of bound access.
If you need it to work for an arbitrary vector length, you have to amend it. One simple solution to get this is to add a test “&& (tid+xx)<blocksize)” in all your individual reduction lines (adapting the xx to the current value).

Edit: Well, actually, it’s a bit more complicated than this: what is implicit here is that blocksize is a power of two, and n is a multiple of 2*blocksize (not so sure of this). I guess you can handle blocksize to be a power of two in the calling side, so let’s assume this is met. And let’s also assume that blocksize==blockDim.x
But you still have a hazard here: “sdata[tid] += g_idata[i] + g_idata[i+blockSize]” might lead to an out of bound access if i is indeed inferior to n (as you test it) but i+blocksize is greater than n.

All in all, this kernel albeit extremely effective (I guess) needs to be straighten for corner cases. My initial correction was an error (I guess) but the problem I pointed do exist.

A.Goude · April 4, 2012, 3:08pm

First of all, you have declared g_odata as double*, but atomicAdd should only be supported for floats (and integers). I am surprised that the code actually compiles with this

I think Gilles_C is correct about the problem with out of bound access. One easy solution would be

unsigned int i = blockIdx.x*blockSize + tid;

unsigned int gridSize = blockSize*gridDim.x;

sdata[tid] = 0;

while (i < n)

{

  sdata[tid] += g_idata[i];

  i += gridSize;

}

which should prevent this error.

Topic		Replies	Views
a short reduction example cannot get what is expected CUDA Programming and Performance	4	2223	March 28, 2012
double loop with inner loop sum using reduction CUDA Programming and Performance	7	1678	April 3, 2012
Double loop, sum on j for each i. atomicAdd()? CUDA Programming and Performance	20	7049	March 25, 2012
Summation CUDA Programming and Performance	10	8351	November 20, 2008
CUDA BUG? atomicAdd CUDA Programming and Performance	1	6125	March 21, 2009
Summing array elements using kernel Access frome the whole block grid CUDA Programming and Performance	3	871	July 16, 2010
Fail to assign value after initialization to 0 CUDA Programming and Performance	0	1557	March 31, 2012
Reduction operation returns incorrect result CUDA Programming and Performance	1	420	November 18, 2018
Reduction Reduction Reduction................. Precision Confusion Race Condition...... HELP! CUDA Programming and Performance	16	10509	December 8, 2009
computing a sum leads to infinite values CUDA Programming and Performance	3	5384	September 16, 2008

sum 10 of 1s, got 9.9999; sum 100000 of 1s, got 1.8535X10^193;

Related topics