How costly is the S2R instruction (reading a special register)?

It’s been suggested to me that reading the value of a special registers has a penalty (compared to using a regular register). Is that true? Is it more costly to do

S2R R0, SR_LANEID;

than, say,

MOV R1, R0;

?

I looked at the operation throughput table in the CUDA programming guide, but it doesn’t mention special registers at all.

I am not aware of any NVIDIA documentation that provides information on that. You may want to set up a microbenchmark and measure it in case it matters. I have honestly never encountered a special register read as something that impacts application-level performance, whether in my own or customer code. I am curious as to how your code uses special registers for a performance difference to become noticeable.

Based on general knowledge of processor design, I consider it plausible and probable that access to a special register is slower than to a GPR (general purpose register). Performance may even differ between various kinds of special registers, so when you microbenchmark, would might want to look at multiple SRs.

Just did some benchmarks, it looks like S2R takes around 22 cycles to complete (some SRs are slower than others, but only by 1-2 cycles). So yes, the MOV would be more efficient.

and how much cycles require MOV on your GPU?

Almost every fixed latency instruction (MOV included) takes 6 cycles (the pipeline depth in Pascal) to get the answer back.

CS2R instructions have fixed latency.
S2R instructions have variable latency. The latency is dependent on the SR being read and the utilization of the other variable latency execution paths such as shared memory and texture. The latency should be low 20s but can grow to 200s if texture and shared memory have high utilization.