This is impossible to diagnose from the fragmentary code snippets shown. If you desire help with debugging, I would suggest posting small, self-contained, buildable and runnable code, that reproduces the problem, along with information on the nvcc invocation used to build the code.
I am using Nsight, I would like to post the whole code but it will be very hard to follow, but what I posted is pretty much every thing could be related to the problem.
and this is my nvcc command
nvcc --cudart static -Xlinker -lgomp --relocatable-device-code=true -gencode arch=compute_50,code=compute_50 -gencode arch=compute_50,code=sm_50 -link -o
Do you have multiple threads writing to the same location in memory?
yes, write in same array but not same element in side the array
Not related to your question or bug, but you launch too many blocks when nElems is a multiple of the blocksize. It is harmless in your code since you check for index number in the kernel, but best to launch only as many blocks as you need.
I am afraid that yor edit will make no difference
it makes difference when nElems==128*k. pretty usual one-off error
ya, you r right but my number of elements (nElems) would have a 0.0000001% chance to be a multiple of 128