explains that instruction throughput depends on “Nominal instruction throughput”. It later says that arithmetic instruction throughput for an integer add, for example, is 4 cycles/warp. Is there any list of nominal throughputs for most/all of CUDA instructions?
But it doesn’t have modulo operation for example. Also, how about atomic functions?