float asssociative Debugging error

kng · April 8, 2010, 2:10pm

Is the following code effected by the fact that float addition is not associative?

// Use first warp of block to compute parallel reduction on the
// partial sum in shared memory.
if (threadIdx.x < 32) {
#pragma unroll
for(int i=32; i<TPB; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i];
}
if (threadIdx.x < 16) { buff[threadIdx.x] += buff[threadIdx.x+16]; }
if (threadIdx.x < 8) { buff[threadIdx.x] += buff[threadIdx.x+8]; }
if (threadIdx.x < 4) { buff[threadIdx.x] += buff[threadIdx.x+4]; }
if (threadIdx.x < 2) { buff[threadIdx.x] += buff[threadIdx.x+2]; }

// Finalise and write out the results to global memory
if (threadIdx.x == 0)  { 
    r[blockIdx.x] = b[blockIdx.x] - buff[0] - buff[1];
}

}

since the results i m getting are very close to the result i compute on the cpu but not quite exact just checking if it is the algorithm or float associativity problem?

tempest69 · April 8, 2010, 5:01pm

Is the following code effected by the fact that float addition is not associative?

// Use first warp of block to compute parallel reduction on the
// partial sum in shared memory.

if (threadIdx.x < 32) {

    #pragma unroll 

    for(int i=32; i<TPB; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i]; 

}

if (threadIdx.x < 16) { buff[threadIdx.x] += buff[threadIdx.x+16]; }

if (threadIdx.x < 8)  { buff[threadIdx.x] += buff[threadIdx.x+8]; }

if (threadIdx.x < 4)  { buff[threadIdx.x] += buff[threadIdx.x+4]; }

if (threadIdx.x < 2)  { buff[threadIdx.x] += buff[threadIdx.x+2]; }
// Finalise and write out the results to global memory
if (threadIdx.x == 0)  { 

    r[blockIdx.x] = b[blockIdx.x] - buff[0] - buff[1];

}
}

since the results i m getting are very close to the result i compute on the cpu but not quite exact just checking if it is the algorithm or float associativity problem?

Your probably looking at a race condition (though a static one)… kinda stickyone at that.

you might need to break these up a bit to get an anserwe equivlent to the CPU.

avidday · April 8, 2010, 5:15pm

Floating point arithmetic is a lot like the old adage attributed to Einstein : “A man with one watch always know exactly what time it is, but a man with two watches is never quite sure”.

Floating point results from anything other than very short calculations, when made at equivalent precision on different architectures will never match. The only thing you can do is compare the magnitude of the relative and absolute errors (and preferably their distribution) and satisfy yourself that they are within reason and that there are no unexplained results. If you are looking at single precision results between the GPU and CPU, be aware of the following:

[*]Single precision on the GPU isn’t IEEE-754 compliant. There is a MAD operation which the compiler likes to use which doesn’t follow the fused multiply add rounding rules. There are math library functions in CUDA you can use to force the compiler to “do the right thing”, at the expense of some performance.

[*]Single precision on the host CPU often isn’t single precision at all - sometimes it is done in double precision and rounded afterwards, sometimes it is done in 80 bit internally and rounded (this is the old 387 FPU instruction set which still gets used by some compilers/libraries on IA32 systems).

[*]As you have noted, there is no associativity in floating point, so the simple act of parallelizing a calculation can change its result (so can compiler optimizations, using SIMD instructions and all sorts of other things).

With all of that, expect there to be differences. Be happy when you can explain them…

eyalhir74 · April 8, 2010, 7:09pm

Floating point arithmetic is a lot like the old adage attributed to Einstein : “A man with one watch always know exactly what time it is, but a man with two watches is never quite sure”.

Floating point results from anything other than very short calculations, when made at equivalent precision on different architectures will never match. The only thing you can do is compare the magnitude of the relative and absolute errors (and preferably their distribution) and satisfy yourself that they are within reason and that there are no unexplained results. If you are looking at single precision results between the GPU and CPU, be aware of the following:

[*]Single precision on the GPU isn’t IEEE-754 compliant. There is a MAD operation which the compiler likes to use which doesn’t follow the fused multiply add rounding rules. There are math library functions in CUDA you can use to force the compiler to “do the right thing”, at the expense of some performance.

[*]Single precision on the host CPU often isn’t single precision at all - sometimes it is done in double precision and rounded afterwards, sometimes it is done in 80 bit internally and rounded (this is the old 387 FPU instruction set which still gets used by some compilers/libraries on IA32 systems).

[*]As you have noted, there is no associativity in floating point, so the simple act of parallelizing a calculation can change its result (so can compiler optimizations, using SIMD instructions and all sorts of other things).

With all of that, expect there to be differences. Be happy when you can explain them…

And of course Fermi joins the party as well ;)

http://forums.nvidia.com/index.php?showtopic=164138

eyal

kng · April 10, 2010, 9:02am

So i m posting some results here just to get some feedback from you guys about my sparse matrix multiplies when i compare the results achieved by the CPU and does achieved by the device and the difference between them do the results indicate that the algorithm isn’t working correctly or is the discrepancy between the results due to the CUDA architecture?

Matrix size 6000000x6000000 x vector the resulting vector is then switched as the input vector and repeated 5 times

Iteration 1	Iteration 2	Iteration 3	Iteration 4	Iteration 5

Device[0]	0.002691466	0.003525914	0.002875949	0.002786225	0.002773622

Device[1]	0.002412131	0.002587259	0.001578028	0.001428416	0.001385353

Device[2]	0.001455923	0.002155874	0.001776277	0.001811765	0.001805371

Device[3]	0.001399427	0.002343167	0.001882699	0.001840642	0.001828669

Device[4]	0.001861754	0.001208553	0.000525089	0.000460119	0.000438813

Device[5]	0.000735813	0.001469866	0.001135976	0.001142191	0.001136943

					

Host[0]	0.002691867	0.003461844	0.002743278	0.002578126	0.002488196

Host[1]	0.002412045	0.002494849	0.001430315	0.001224104	0.001129431

Host[2]	0.001456124	0.002102109	0.001672475	0.001653573	0.001593

Host[3]	0.001399498	0.002299091	0.001792996	0.001700758	0.001637875

Host[4]	0.001861915	0.001193579	0.00049543	0.000415142	0.000379034

Host[5]	0.000735834	0.001446452	0.001087024	0.001063812	0.001028134

					

Host[0]-Device[0]	4.005E-07	-6.40697E-05	-0.000132671	-0.0002081	-0.000285426

Host[1]-Device[1]	-8.59E-08	-9.24102E-05	-0.000147713	-0.000204312	-0.000255922

Host[2]-Device[2]	2.018E-07	-5.37641E-05	-0.000103802	-0.000158192	-0.000212371

Host[3]-Device[3]	7.15E-08	-4.40755E-05	-8.97027E-05	-0.000139885	-0.000190794

Host[4]-Device[4]	1.612E-07	-1.49738E-05	-2.96595E-05	-4.4977E-05	-5.97787E-05

Host[5]-Device[5]	2.136E-08	-2.34143E-05	-4.89518E-05	-7.83781E-05	-0.000108809

avidday · April 10, 2010, 1:05pm

Without more information, it is impossible to say. Why not generate a matrix with known properties (like a tridiagonal matrix or unit matrix) and compute the product with a random vector and analyze the relative and absolute deviation from the expect solution? That will surely give at least some sort of reliable indication of whether your multiplication code is working as expected or not.

kng · April 10, 2010, 4:08pm

the thing is since the matrix is so big i have different conditions depending the the column length therefore i need to make some sort of test using a lower triangular form matrix. Was thinking of doing

this sort of multiplication each time the expected result should be 1 or very close value to 1 what do you think?

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

X

0.1667 0.2000 0.2500 0.3333 0.5000 1.0000

0.1667 0.2000 0.2500 0.3333 0.5000 0.0000

0.1667 0.2000 0.2500 0.3333 0.0000 0.0000

0.1667 0.2000 0.2500 0.0000 0.0000 0.0000

0.1667 0.2000 0.0000 0.0000 0.0000 0.0000

0.1667 0.0000 0.0000 0.0000 0.0000 0.0000

kng · April 11, 2010, 8:38am

the thing is since the matrix is so big i have different conditions depending the the column length therefore i need to make some sort of test using a lower triangular form matrix. Was thinking of doing

this sort of multiplication each time the expected result should be 1 or very close value to 1 what do you think?
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

X

0.1667 0.2000 0.2500 0.3333 0.5000 1.0000

0.1667 0.2000 0.2500 0.3333 0.5000 0.0000

0.1667 0.2000 0.2500 0.3333 0.0000 0.0000

0.1667 0.2000 0.2500 0.0000 0.0000 0.0000

0.1667 0.2000 0.0000 0.0000 0.0000 0.0000

0.1667 0.0000 0.0000 0.0000 0.0000 0.0000

So i implemented the above lower triangular form data source and did matrix multiplication on 400000x400000 matrix with a vector containing all one the results turned out to be all correct with and error of 0.000002.

Now the question at hand is i know the sparse matrix multiplier is working correctly and all my other PageRank modules work correctly yet due to this float discrepancy the number of iteration for convergence on the CPU is 54 and the number of iteration required for convergence on the host is 132. Does anyone have any idea why the device implementation required much more iterations than the CPU version?

NODE ID	DEVICE RANK	HOST RANK

1	0.002252856	0.00302125

2	0.001188674	0.001681346

3	0.001400959	0.00193062

4	0.00145611	0.001967428

5	0.000532672	0.000671726

6	0.00090303	0.001213931

		

NoIterations  54				 132

kng · April 11, 2010, 10:53pm

So i implemented the above lower triangular form data source and did matrix multiplication on 400000x400000 matrix with a vector containing all one the results turned out to be all correct with and error of 0.000002.

Now the question at hand is i know the sparse matrix multiplier is working correctly and all my other PageRank modules work correctly yet due to this float discrepancy the number of iteration for convergence on the CPU is 54 and the number of iteration required for convergence on the host is 132. Does anyone have any idea why the device implementation required much more iterations than the CPU version?
NODE ID	DEVICE RANK	HOST RANK

1	0.002252856	0.00302125

2	0.001188674	0.001681346

3	0.001400959	0.00193062

4	0.00145611	0.001967428

5	0.000532672	0.000671726

6	0.00090303	0.001213931

		

NoIterations  54				 132

I think i have a bug somewhere in my code and i found this atomic float addition on the forums and i m making quite heavy use of it. is this function safe and grantees that the atomic float addition is desired one? since logically in my mind it makes sens but i m not quite sure

__device__ inline void atomicAdd(float* address, float value)

{

  float old = value;  

  while ((old = atomicExch(address, atomicExch(address, 0.0f)+old))!=0.0f);

};

tera · April 12, 2010, 12:52am

I think i have a bug somewhere in my code and i found this atomic float addition on the forums and i m making quite heavy use of it. is this function safe and grantees that the atomic float addition is desired one? since logically in my mind it makes sens but i m not quite sure
__device__ inline void atomicAdd(float* address, float value)

{

  float old = value;  

  while ((old = atomicExch(address, atomicExch(address, 0.0f)+old))!=0.0f);

};

It’s safe, but not very efficient, as it needs two atomic operations even in the optimal case.

You can try this atomic operation instead:

__device__ float atomicAdd(float *address, float value)

{

	int oldval, newval, readback;

	oldval = __float_as_int(*address);

	newval = __float_as_int(__int_as_float(oldval) + value);

	while ((readback=atomicCAS((int *)address, oldval, newval)) != oldval) {

		oldval = readback;

		newval = __float_as_int(__int_as_float(oldval) + value);

	}

	return __int_as_float(oldval);

}

kng · April 12, 2010, 9:23am

It’s safe, but not very efficient, as it needs two atomic operations even in the optimal case.

You can try this atomic operation instead:

__device__ float atomicAdd(float *address, float value)

{

	int oldval, newval, readback;

	oldval = __float_as_int(*address);

	newval = __float_as_int(__int_as_float(oldval) + value);

	while ((readback=atomicCAS((int *)address, oldval, newval)) != oldval) {

		oldval = readback;

		newval = __float_as_int(__int_as_float(oldval) + value);

	}

	return __int_as_float(oldval);

}

Thanks dude i confirm both of them produce the correct result i narrowed the problem to a new thread i posted

Different Results with different memory types

Topic		Replies	Views
Different results using diffrent memory types CUDA Programming and Performance	22	4392	April 14, 2010
atomicAdd(float,float) - atomicMul(float,float) ... CUDA Programming and Performance	13	56261	July 29, 2010
Several questions on cuda (arithmetic, rounding, for loop ad performance) CUDA Programming and Performance	8	3098	April 13, 2023
matrix multiply reduction CUDA Programming and Performance	41	35516	January 15, 2011
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7689	July 22, 2010
Sparse matrix manipulation CUDA Programming and Performance	18	1002	January 11, 2024
Accuracy of 1D linear interpolation by CUDA texture interpolation CUDA Programming and Performance	25	14574	January 29, 2013
CPU and GPU floating point calculations Results are different CUDA Programming and Performance	6	21877	August 7, 2010
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	664	April 4, 2017
Possible to use the CUDA math API integer intrinsics to find the nth unset bit in a 32 bit int CUDA Programming and Performance	37	8327	March 1, 2015

float asssociative Debugging error

Related topics