CUDA wrong summation results

please close

This is impossible to diagnose from the fragmentary code snippets shown. If you desire help with debugging, I would suggest posting small, self-contained, buildable and runnable code, that reproduces the problem, along with information on the nvcc invocation used to build the code.

I am using Nsight, I would like to post the whole code but it will be very hard to follow, but what I posted is pretty much every thing could be related to the problem.
and this is my nvcc command

nvcc --cudart static -Xlinker -lgomp --relocatable-device-code=true -gencode arch=compute_50,code=compute_50 -gencode arch=compute_50,code=sm_50 -link -o

Do you have multiple threads writing to the same location in memory?

please close

yes, write in same array but not same element in side the array

Not related to your question or bug, but you launch too many blocks when nElems is a multiple of the blocksize. It is harmless in your code since you check for index number in the kernel, but best to launch only as many blocks as you need.

Change

ElemsCalc<<<(nElems/128)+1,128>>>

to

ElemsCalc<<<((nElems+127)/128),128>>>

I am afraid that yor edit will make no difference

it makes difference when nElems==128*k. pretty usual one-off error

ya, you r right but my number of elements (nElems) would have a 0.0000001% chance to be a multiple of 128