Is the following code effected by the fact that float addition is not associative?

// Use first warp of block to compute parallel reduction on the
// partial sum in shared memory.
if (threadIdx.x < 32) { #pragma unroll
for(int i=32; i<TPB; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i];
}
if (threadIdx.x < 16) { buff[threadIdx.x] += buff[threadIdx.x+16]; }
if (threadIdx.x < 8) { buff[threadIdx.x] += buff[threadIdx.x+8]; }
if (threadIdx.x < 4) { buff[threadIdx.x] += buff[threadIdx.x+4]; }
if (threadIdx.x < 2) { buff[threadIdx.x] += buff[threadIdx.x+2]; }

// Finalise and write out the results to global memory
if (threadIdx.x == 0) {
r[blockIdx.x] = b[blockIdx.x] - buff[0] - buff[1];
}

}

since the results i m getting are very close to the result i compute on the cpu but not quite exact just checking if it is the algorithm or float associativity problem?

Floating point arithmetic is a lot like the old adage attributed to Einstein : “A man with one watch always know exactly what time it is, but a man with two watches is never quite sure”.

Floating point results from anything other than very short calculations, when made at equivalent precision on different architectures will never match. The only thing you can do is compare the magnitude of the relative and absolute errors (and preferably their distribution) and satisfy yourself that they are within reason and that there are no unexplained results. If you are looking at single precision results between the GPU and CPU, be aware of the following:

Single precision on the GPU isn’t IEEE-754 compliant. There is a MAD operation which the compiler likes to use which doesn’t follow the fused multiply add rounding rules. There are math library functions in CUDA you can use to force the compiler to “do the right thing”, at the expense of some performance.

Single precision on the host CPU often isn’t single precision at all - sometimes it is done in double precision and rounded afterwards, sometimes it is done in 80 bit internally and rounded (this is the old 387 FPU instruction set which still gets used by some compilers/libraries on IA32 systems).

As you have noted, there is no associativity in floating point, so the simple act of parallelizing a calculation can change its result (so can compiler optimizations, using SIMD instructions and all sorts of other things).

With all of that, expect there to be differences. Be happy when you can explain them…

So i m posting some results here just to get some feedback from you guys about my sparse matrix multiplies when i compare the results achieved by the CPU and does achieved by the device and the difference between them do the results indicate that the algorithm isn’t working correctly or is the discrepancy between the results due to the CUDA architecture?

Matrix size 6000000x6000000 x vector the resulting vector is then switched as the input vector and repeated 5 times

Without more information, it is impossible to say. Why not generate a matrix with known properties (like a tridiagonal matrix or unit matrix) and compute the product with a random vector and analyze the relative and absolute deviation from the expect solution? That will surely give at least some sort of reliable indication of whether your multiplication code is working as expected or not.

the thing is since the matrix is so big i have different conditions depending the the column length therefore i need to make some sort of test using a lower triangular form matrix. Was thinking of doing

this sort of multiplication each time the expected result should be 1 or very close value to 1 what do you think?

So i implemented the above lower triangular form data source and did matrix multiplication on 400000x400000 matrix with a vector containing all one the results turned out to be all correct with and error of 0.000002.

Now the question at hand is i know the sparse matrix multiplier is working correctly and all my other PageRank modules work correctly yet due to this float discrepancy the number of iteration for convergence on the CPU is 54 and the number of iteration required for convergence on the host is 132. Does anyone have any idea why the device implementation required much more iterations than the CPU version?

I think i have a bug somewhere in my code and i found this atomic float addition on the forums and i m making quite heavy use of it. is this function safe and grantees that the atomic float addition is desired one? since logically in my mind it makes sens but i m not quite sure

__device__ inline void atomicAdd(float* address, float value)
{
float old = value;
while ((old = atomicExch(address, atomicExch(address, 0.0f)+old))!=0.0f);
};