atomicAdd (to global memory) seems to generate basically RED instructions (in SASS), but sometimes ATOM seems to be generated. I’m building a small test to post the situation here, but I couldn’t build it yet.
We can see that ATOM is generated when it has return value, according to related post, but ATOM is generated even when there is no return value
related post : Difference between RED and ATOMG sass instruction
Does anyone know the conditions under which ATOM is generated?
ATOM is used when addresses are generic addresses. The compiler doesn’t know if all threads are global or shared. Threads can have a mix of global, local, or shared memory addresses.
ATOMG is used when address is known to be global memory
ATOMS is used when address is known to be shared memory offset
RED does not support generic addresses to local or shared memory window. In this case the compiler will convert to an ATOM instruction.
The CUDA ABI states that address passed to a device function have to be generic addresses. In the device function the compiler will often check that all addresses are to global or shared memory and use the more optimal instruction.
If you post a reproducible please include the nvcc command line, nvcc version, and the GPU you executed on if you are specifying PTX compilation (vs. SASS).
I didn’t think you could do an atomic on a local window address.
I think so, too.
And from the profile result of nsight compute, there is no register spill because there is no Local load & store. (Of course, there is no shared load & store.)
Anyway, I will prepare a reproducible code. Now, however, to attempt to minimize the code brings that RED instructions are generated…