It works correctly for small number of threads (<=16) and some times fail for bigger one. It seems that I did not understand something. Is it right that __synchthreads() do it only on a subpool of threads running on one block, and this pool is exactly one warp?
__syncthreads() is applied to all the threads of a block, whether they belong to the same warp or not.
A warp is 32 threads that belong to the same threadblock. Threads 0, 31 form warp 0, threads 32, 63 form warp 2, etc. Check the programming guide for how thread IDs and distribution among warps is done when using 2D or 3D threadblocks.
I’ve gone through your code and the problem is that you accumulate and store each thread’s sum into a float variable. If you change the types of v and z to float, you will get the exact results.
This is an issue with IEEE754-standard, not hardware or CUDA. There are many many integers that can be accurately stored in int format, but not in float. An excellent reference to the intricacies of the floating point format is a paper by David Goldberg “What every computer scientist should know about floating-point arithmetic.” google gives many links to pdfs.
Sorry for this thread, you are right! I thouth that I am doing the same type of summation and working only with doubles (not floats) last 20 years and just forget that float can index till 2^23…