The atomic functions do not provide correct results

I tried to test Atomic function with this routine: testKernel
The project file is simpleAtomicIntrinsics_vs2019.vcxproj from NVIDIA Corporation\CUDA Samples\v11.0\0_Simple\simpleAtomicIntrinsics.

I found that the results are not correct based on the definition of atomic functions. Here are the results in visual studio at the beginning and ending routine.

The application computes whether the results are correct in the computeGold routine. So I would assume that if computeGold is not reporting an error, things are working correctly.

My guess would be here that you have set breakpoints or otherwise used the visual studio debugging interface in such a way that the results you are looking at are only after a single warp has executed.

I would also encourage you where possible to not post text as images but post it as formatted text, when asking for help here.

Hi Robert_Crovella, Thank for your reply.
My testing routine is not “ComputeGold” routine. it is testKernel.
From the definition of atomic Add function, my expected result of the code : atomicAdd(&g_odata[0],10) will be 10 because the old data of g_odata is 0.
my expected result of the code : atomicSub(&g_odata[1],10) will be -10 because the old data of g_odata[1] is 0.
That is the problem.

Perhaps you should study the whole sample code, to understand how it works, rather than just the kernel.

That would be true if only one thread were running. But you have multiple threads running in parallel, and in particular you have threads in a warp executing in lockstep. What you are observing is the result after 32 threads have completed the work, specifically the first warp. You’ll need to understand how a GPU executes code. The debugger does not isolate a single thread for you. When you allow a thread to execute this line:


at a minimum, it will not be one single thread executing that line of code, it will be all the active threads in the warp.

If you would like to see the behavior of just a single thread, in isolation, one way to do that would be to modify that kernel launch, so that only one thread is executing.

I saw the whole project and found the routine : computeGold to check the results from the routine: testKernel. There is no error report from computeGold, which means the results from testKernel are correct.
I checked the variable in Locals window and see what happen after calling atomicAdd.

- [Launch Details] {…}
@flatBlockIdx 0 uint64_t
@flatThreadIdx 0 ulong
+ blockIdx { x=0 y=0 z=0 } uint3
+ threadIdx { x=0 y=0 z=0 } uint3
+ gridDim { x=64 y=1 z=1 } dim3
+ blockDim { x=256 y=1 z=1 } dim3

I found that the information of the thread was not changed, but the value of g_odata[0] was changed to 320. If the displaying information is not expected, it will be difficult to debug the codes.
Your explanation about executing the line code: atomicAdd( &g_odata[0],10) is very good.
The result after is done all active threads in the warp.
Thank you very much for your good explanation.