float asssociative Debugging error

Is the following code effected by the fact that float addition is not associative?

// Use first warp of block to compute parallel reduction on the
// partial sum in shared memory.
if (threadIdx.x < 32) {
#pragma unroll
for(int i=32; i<TPB; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i];
if (threadIdx.x < 16) { buff[threadIdx.x] += buff[threadIdx.x+16]; }
if (threadIdx.x < 8) { buff[threadIdx.x] += buff[threadIdx.x+8]; }
if (threadIdx.x < 4) { buff[threadIdx.x] += buff[threadIdx.x+4]; }
if (threadIdx.x < 2) { buff[threadIdx.x] += buff[threadIdx.x+2]; }

// Finalise and write out the results to global memory
if (threadIdx.x == 0)  { 
    r[blockIdx.x] = b[blockIdx.x] - buff[0] - buff[1];


since the results i m getting are very close to the result i compute on the cpu but not quite exact just checking if it is the algorithm or float associativity problem?

Your probably looking at a race condition (though a static one)… kinda stickyone at that.

you might need to break these up a bit to get an anserwe equivlent to the CPU.

Floating point arithmetic is a lot like the old adage attributed to Einstein : “A man with one watch always know exactly what time it is, but a man with two watches is never quite sure”.

Floating point results from anything other than very short calculations, when made at equivalent precision on different architectures will never match. The only thing you can do is compare the magnitude of the relative and absolute errors (and preferably their distribution) and satisfy yourself that they are within reason and that there are no unexplained results. If you are looking at single precision results between the GPU and CPU, be aware of the following:

    Single precision on the GPU isn’t IEEE-754 compliant. There is a MAD operation which the compiler likes to use which doesn’t follow the fused multiply add rounding rules. There are math library functions in CUDA you can use to force the compiler to “do the right thing”, at the expense of some performance.

    Single precision on the host CPU often isn’t single precision at all - sometimes it is done in double precision and rounded afterwards, sometimes it is done in 80 bit internally and rounded (this is the old 387 FPU instruction set which still gets used by some compilers/libraries on IA32 systems).

    As you have noted, there is no associativity in floating point, so the simple act of parallelizing a calculation can change its result (so can compiler optimizations, using SIMD instructions and all sorts of other things).

With all of that, expect there to be differences. Be happy when you can explain them…

And of course Fermi joins the party as well ;)



So i m posting some results here just to get some feedback from you guys about my sparse matrix multiplies when i compare the results achieved by the CPU and does achieved by the device and the difference between them do the results indicate that the algorithm isn’t working correctly or is the discrepancy between the results due to the CUDA architecture?

Matrix size 6000000x6000000 x vector the resulting vector is then switched as the input vector and repeated 5 times

Iteration 1	Iteration 2	Iteration 3	Iteration 4	Iteration 5

Device[0]	0.002691466	0.003525914	0.002875949	0.002786225	0.002773622

Device[1]	0.002412131	0.002587259	0.001578028	0.001428416	0.001385353

Device[2]	0.001455923	0.002155874	0.001776277	0.001811765	0.001805371

Device[3]	0.001399427	0.002343167	0.001882699	0.001840642	0.001828669

Device[4]	0.001861754	0.001208553	0.000525089	0.000460119	0.000438813

Device[5]	0.000735813	0.001469866	0.001135976	0.001142191	0.001136943


Host[0]	0.002691867	0.003461844	0.002743278	0.002578126	0.002488196

Host[1]	0.002412045	0.002494849	0.001430315	0.001224104	0.001129431

Host[2]	0.001456124	0.002102109	0.001672475	0.001653573	0.001593

Host[3]	0.001399498	0.002299091	0.001792996	0.001700758	0.001637875

Host[4]	0.001861915	0.001193579	0.00049543	0.000415142	0.000379034

Host[5]	0.000735834	0.001446452	0.001087024	0.001063812	0.001028134


Host[0]-Device[0]	4.005E-07	-6.40697E-05	-0.000132671	-0.0002081	-0.000285426

Host[1]-Device[1]	-8.59E-08	-9.24102E-05	-0.000147713	-0.000204312	-0.000255922

Host[2]-Device[2]	2.018E-07	-5.37641E-05	-0.000103802	-0.000158192	-0.000212371

Host[3]-Device[3]	7.15E-08	-4.40755E-05	-8.97027E-05	-0.000139885	-0.000190794

Host[4]-Device[4]	1.612E-07	-1.49738E-05	-2.96595E-05	-4.4977E-05	-5.97787E-05

Host[5]-Device[5]	2.136E-08	-2.34143E-05	-4.89518E-05	-7.83781E-05	-0.000108809

Without more information, it is impossible to say. Why not generate a matrix with known properties (like a tridiagonal matrix or unit matrix) and compute the product with a random vector and analyze the relative and absolute deviation from the expect solution? That will surely give at least some sort of reliable indication of whether your multiplication code is working as expected or not.

the thing is since the matrix is so big i have different conditions depending the the column length therefore i need to make some sort of test using a lower triangular form matrix. Was thinking of doing

this sort of multiplication each time the expected result should be 1 or very close value to 1 what do you think?

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000


0.1667 0.2000 0.2500 0.3333 0.5000 1.0000

0.1667 0.2000 0.2500 0.3333 0.5000 0.0000

0.1667 0.2000 0.2500 0.3333 0.0000 0.0000

0.1667 0.2000 0.2500 0.0000 0.0000 0.0000

0.1667 0.2000 0.0000 0.0000 0.0000 0.0000

0.1667 0.0000 0.0000 0.0000 0.0000 0.0000

So i implemented the above lower triangular form data source and did matrix multiplication on 400000x400000 matrix with a vector containing all one the results turned out to be all correct with and error of 0.000002.

Now the question at hand is i know the sparse matrix multiplier is working correctly and all my other PageRank modules work correctly yet due to this float discrepancy the number of iteration for convergence on the CPU is 54 and the number of iteration required for convergence on the host is 132. Does anyone have any idea why the device implementation required much more iterations than the CPU version?


1	0.002252856	0.00302125

2	0.001188674	0.001681346

3	0.001400959	0.00193062

4	0.00145611	0.001967428

5	0.000532672	0.000671726

6	0.00090303	0.001213931


NoIterations  54				 132

I think i have a bug somewhere in my code and i found this atomic float addition on the forums and i m making quite heavy use of it. is this function safe and grantees that the atomic float addition is desired one? since logically in my mind it makes sens but i m not quite sure

__device__ inline void atomicAdd(float* address, float value)


  float old = value;  

  while ((old = atomicExch(address, atomicExch(address, 0.0f)+old))!=0.0f);


It’s safe, but not very efficient, as it needs two atomic operations even in the optimal case.

You can try this atomic operation instead:

__device__ float atomicAdd(float *address, float value)


	int oldval, newval, readback;

	oldval = __float_as_int(*address);

	newval = __float_as_int(__int_as_float(oldval) + value);

	while ((readback=atomicCAS((int *)address, oldval, newval)) != oldval) {

		oldval = readback;

		newval = __float_as_int(__int_as_float(oldval) + value);


	return __int_as_float(oldval);


Thanks dude i confirm both of them produce the correct result i narrowed the problem to a new thread i posted

Different Results with different memory types