A list of nominal CUDA instructions throughput

This presentation:

explains that instruction throughput depends on “Nominal instruction throughput”. It later says that arithmetic instruction throughput for an integer add, for example, is 4 cycles/warp. Is there any list of nominal throughputs for most/all of CUDA instructions?

I found
But it doesn’t have modulo operation for example. Also, how about atomic functions?

Table 2 in Section 5.4.1 of the CUDA C Programming Guide (CUDA 6 release candidate) gives the throughput for different categories of instruction on each of the compute capabilities.

The reason the modulo operation is not listed in that table is because there is no modulo instruction (or division instruction, for that matter) provided by the hardware.

As for atomic functions, the throughput depends on the access pattern, so it can be quite variable. Best to benchmark it on your hardware with realistic data.