I have a general question about two implementations of an algorithm with CUDA. The algorithm has the task of summing the columns of a matrix. Firstly, I use one thread per block and the same number of blocks as columns. The kernel uses a for loop to add an entire column with one thread. In the second implementation, I use two threads per block and the same number of blocks as columns. In the kernel for loop, each thread will add half of a column. This returns a 2xN matrix, which then is passed through the same kernel to collapse a second time. Both implementations give the correct answer, but the first consistently gives 0 error compared to MATLAB sum() function and the second gives errors in the order of 1e-5. Can anyone tell me why that is?

IEEE floating point math is not transitive. Summing the same numbers in a different order will produce different results.