Reduction Reduction Reduction................. Precision Confusion Race Condition...... HELP!

I tried my hand at reduction today and burnt all my fingers… :-(

This is just a plain addition kernel. It works fine if the “floating” point numbers have zero fractional part (like integers).

However, the momment I use fractional parts (randomly generated floating point numbers), I get in-correct results.

I am aware of the 80-bit thing in x87 and I disable it explicity in gcc options… Here is my makefile:

Even double precision falters at times… but not as badly as singles. And, this is on a TESLA C1060.

Here is my reduction code. Each block reduces to one value… I need to call this in a loop until the number of blocks is 1. But the code fails for even 1 block and 9 elements (4 also fails). It works well if I use thread 0 to sum all elements in a sequential way. Then there is no parallelism… I mainly do this to make sure that there are NO precision related issues i.e. apple to apple comparisons only.


27 global void REDUCE_ADD(REAL *data, int n, REAL *result)

28 {

29 extern device volatile REAL cache;

30 shared volatile int base;

31 int totalData;

32 int fold;

33 REAL temp;


35 if (threadIdx.x == 0)

36 {

37 base = (blockIdx.x*(blockDim.x << 1));

38 }

39 __syncthreads();


41 if (base >= n)

42 {

43 return;

44 }


46 if ((base + threadIdx.x) < n)

47 {

48 cache[threadIdx.x] = data[base + threadIdx.x];

49 //printf("GPU LOAD Thread %d: %f ", threadIdx.x, cache[threadIdx.x]);

50 }


52 if ((base + blockDim.x + threadIdx.x) < n)

53 {

54 cache[threadIdx.x] = cache[threadIdx.x] + data[base + blockDim.x + threadIdx.x];

55 }


57 __syncthreads();


59 totalData = (n - base) > (blockDim.x) ? (blockDim.x) : (n - base);

60 fold = (totalData >> 1);

61 while(fold)

62 {

63 if (threadIdx.x < fold)

64 {

65 //printf(“Adding %f, %f\n”, cache[threadIdx.x], cache[threadIdx.x + fold]);

66 cache[threadIdx.x] = cache[threadIdx.x] + cache[threadIdx.x + fold];

67 if(threadIdx.x == 0)

68 {

69 if ((totalData & 1) == 1)

71 cache[threadIdx.x] = cache[threadIdx.x] + cache[totalData -1];

72 }

73 }

74 }


76 totalData = fold;

77 fold = fold >> 1;

78 //if (threadIdx.x ==0)

79 //printf("\n\n");

80 __syncthreads();


82 }

83 __syncthreads();


85 if (threadIdx.x == 0)

86 {

87 result[blockIdx.x] = cache[0];

88 }

89 return;

90 }


I compare the CPU result and GPU result using an IF statement (NO human comparison) before throwing error…

And, it fails randomly… frequently… though sometimes it works fine… (for integer like floats, it works fine all the time)

Can some1 help me here please?


Ubuntu Linux 9.04, CUDA 2.2, TESLA C1060

Best Regards,


OK, so I took your code and tried to compile it.

  1. I had to remove the line numbers :P
  2. Closing bracket line 72 should be removed
  3. I replaced the extern with a static shared array of size 1024. I just don’t like externs ;)
  4. I get completly incorrect results, needs investigation…
  5. I assume you are aware this kernel works only if n<gridDim.xblockDim.x2 ?
  6. Testing on gridSize=60, blockSize=512. N=60*512 gives correct result. N=9 crashes completly.
  7. You might want to add “result[blockIdx.x]=0.0f;” after line 42 to ensure no trashy elemet is added when you sum up final result array.
  8. I get correct results for n=9 or n=4 (up to few ulps). Using floats. The crash reported in 6 was due to my error in host code.

I think you will always get some precission issues. To get rid of that problem you would have to sum up elements in exactly the same order, which boils down to a sequential algorithm.


THANKS a LOT for testing… Very Sorry about the extra work…

And as I told before, the kernel needds to be called repeatedly and the result pointer needs to be an array… To get rid of all these hazzle, let me give you the full code…PFA…


  1. Currently the code uses “integer like” floats… Change the casting in “randFloat” function to test for other cases.
    a) This code (with integer-like floats) works PERFECTLY if “REAL” is declared as “double”.
    If REAL is declared as “float” then error occurs. This is slightly different from the version I said earlier post… But this one is the correct one.

  2. I suggest - check with small numbers. Just change the “DATA_SIZE” define and you are good to go. Even DATA_SIZE of 9 or 4 with fully random floats will cause deviation. This deviation occurs easily in single precision than in double precision.
    a) Again with fully random numbers, if I use thread 0 alone to sum up, it all works fine. So this is not a hardware capability issue is what i believe. May be, when multiple threads execute therez some trucation going on???

  3. During debug sessions, when I print code, i see the GPU rounds things off easily… Like 1.75 to 2.00 and so on — but this was seen in device emulation as well… Hmm… What does NVCC use during deviceemu? x87 or xmm??
    Makefile.txt (109 Bytes) (7.91 KB)

On theoretical part. Let us asume we have some real type which has 5 significant digits and addition operation rounds up to nearest representable value. We want to sum up the following 4 values:

f1=1.0004 e00

f2=1.0001 e01

f3=1.0004 e00

f4=1.0001 e01

If we add up them sequentially we get:

s1:=f1+f2=1.1001 e01

s2:=s1+f3=1.2001 e01

s3:=s2+f4=2.2001 e01

If we add them by folding we get:

p1:=f1+f3=2.0008 e00

p2:=f2+f4=2.0002 e01

p3:=p1+p2=2.2003 e01 (rounding up)

And finally p3!=s3 although we used exactly the same floating point model.

In a binary form it might be slightly harder to show, but I am sure similar example can be given there.

Unless you do folding on CPU side as well?

I understand minor differences… I am seeing big differences… Lemme dig further more and come back on this.

Also integer like floats are also causing differences with “float” and not with “double”…

Ah, I got impression you need exactly the same results since you talked about disabling 80-bit floats in gcc.
Check my first response as I edited it several times :P
I get only small error differences now…

------------DATA SIZE 15360 --------------

REDUCE_ADD Kernel: Grid 240, Block 32
REDUCE_ADD Kernel: Grid 4, Block 32
REDUCE_ADD Kernel: Grid 1, Block 32
The GPU calcuation for finding SUM took 0.000234 seconds
GPU SUM Result = 23000458.000000

The GPU calcuation for finding MAXIMUM took 0.000065 seconds
GPU MAX Result = 1999.000000

CPU calculations…
The CPU took 0.000063 seconds to calculate SUM
CPU SUM Result = 23000562.000000

The CPU took 0.000065 seconds to calculate MAX
CPU MAX Result = 1999.000000
********** BAD GPUSum *****************

All Random numbers between 1000 to 2000

Do you think this is a precision error? No fractional part in the “input floats”. If I make REAL as double, all these vanish away…

Of course it is a precission error. Floats have 5 to 6 significant digits in decimal representation :)



Precision error??? 104 is the difference… And we are adding only 15360 elements…

The error is on 6th last bit. I believe that is an error which summed itself up from 1 ulp errors.
(note the above binare representation are IEEE 754 but I believe the difference is not that big if any at all)

Check this applet:

Welcome to the world of single precision arithmetic. Have a nice day.

If you want greater accuracy in single precision, then google “Kahan summation”.

Thank you Cyg… I will look into it.

Is the Kahan summation parallable?



Sure, you can use it on partial sums just as easily as on a large serial sum. Each summing thread just maintains its own local truncation carry forward value, which is discarded when the partial summation is completed , at which point the partial sum should be correctly rounded. Whatever sums the partial sums (a second reduction pass, or a host thread) can also use Kahan summation to arrive at a final value. It should be considerably more accurate than doing nothing (although about twice as slow).

You’ve answered my next question :) is it slower? :)

why twice as slow?? did you try/see any CUDA implementations of this?

How would this correlate (run time/accuracy/…) to running “regular” summation using doubles??



Parallel reduction might actually introduce less errors than naive sequential algorithm. Error would be of order log(n) rather than n.
My suggestion: try applying Kahan summation for your CPU code and then compare CPU/GPU.

In Kahan summation, each accumulation operation requires 4 FLOPs rather than 1 FLOP and about twice as many loads and stores as a naïve accumulator would. How that translates into relative performance is rather architecture dependent. But it is slower.

I am pretty sure a couple of SDK entries that compute sums use it, but I might be wrong about it. There was an old DrDobbs article I remember reading about 15 years ago which covered more or less everything you ever wanted to know about summation but were afraid to ask. I had a coffee stained photocopy stored in a prominent place for many years. You might try and track it down.

EDIT: and through the magic of the internet, here it s online!