overhead for simply including atomic functions even not used?

I had some very strange findings with a CUDA code of mine. The following is taken from the kernel:

01	  if(gcfg->dosave && mediaid==0){

02			uint j,baseaddr=0;

03			j=finddet(&p);

04			if(j){

05				 baseaddr=atomicAdd(detectedphoton,1);

06				 if(baseaddr<gcfg->maxdetphoton){

07					   //write to the global mem buffer

08				 }

09			}

10	  }

you can see on line#01, I do a test and see if the save data flag is on, if it is on, I will grab the current buffer write position to baseaddr, on line#05, and then write the necessary data to the global memory buffer in the block at line#07.

The strange thing is, even I set gcfg->dosave to 0, I am getting 17% slow-down compared to the code without this block! I can get the original speed only if I commend out the lines between 04 and 09, or explicitly put “0 &&” in the condition on line#01. Seems the key slow-down is from the atomic function. However, if gcfg->dosave=0, this is not supposed to be executed !!!

Does this make sense to you? any explanations or workaround?

(I am using CUDA 3.1 with a GTX 470)


an update, if I commend out line#05, I get the same slow-down. This indicates that the issue is not atomic-specific, but by simply involving a block of “expensive” code. Even it is not executed, it still add overhead. Still does not make sense to me.

More code may mean using more registers in total, even if it is dead code. Occupancy on the GPU may go down as a result of higher register count.

thank you for your reply, but I am not entirely sure if the extra registers can make such a big change in performance. Using nvcc ptxas verbose option, commending out the code block change register number from 56 to 55 (yes, I know it is a lot of registers, but it works for my application), I will be really unlucky if that single register can change the speed by 25% :(

It can, take e.g. the possibility that you go from 4 blocks at a time running on a multiprocessor to 3 blocks at a time running on a multiprocessor.

I took out two registers from the rest of the kernel, but I am still seeing this speed reduction. In other words, even the register numbers are the same, including this segment of code, even it does not execute, will cause a 25% slow-down. Any other possibilities?

How do you rule out the write to the global mem buffer itself isn’t the problem?

If I commend out line#07, I got the same speed reduction, roughly identical to if I commend out the atomics on line #05, or if I keep both lines. So, my conclusion was that the compiler was not able to optimize the code as long as either line 5 or line 7 appears. Even both lines are not executed, there was something happened which prevented the code from running at the optimal speed.

If you are interested, my code can be checked out from SVN anonymously at:


the kernel code is in mcx/src/mcx_core.cu, the block between line102~106 is roughly identical to my previous example. use “make det” to compile the code. The test script is in mcx/example/quicktest/run_qtest.sh (you need to edit it and change mcx to mcx_det)

if you want to browse it online, the kernel unit is here


let me know if anyone has any thoughts about this. thanks