Poor atomicCAS performance on shared memory

I am using this trick to emulate an otherwise unsupported atomic operation (saturated addition of 32-bit integers) on an array element held in shared memory. I am running on an RTX 2080 Ti on RHEL 7 with CUDA 10.1. If I compile my program with -arch=compute_60, then my runtime is almost as good as using atomicAdd(). However, if I compile with -arch=compute_75, then the runtime increases by a factor of ten. I suspect this has something to do with Volta’s more flexible control flow, but I can’t figure out what. Is there a better way of doing this now?

Have you compared the SASS assembly code to see what the difference in code might be?

Also dump the PTX code with the --keep compiler option while you’re at it ;-)

As far as I can see, the PTX is not affected by -arch (except for the ‘target’ field). I have not yet compared the SASS assembly. I guess I need to figure out how to do that.