http://on-demand.gputechconf.com/gtc-express/2011/presentations/NVIDIA_GPU_Computing_Webinars_Further_CUDA_Optimization.pdf

explains that instruction throughput depends on “Nominal instruction throughput”. It later says that arithmetic instruction throughput for an integer add, for example, is 4 cycles/warp. Is there any list of nominal throughputs for most/all of CUDA instructions?

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

But it doesn’t have modulo operation for example. Also, how about atomic functions?