What's the difference between special registers and general registers?

I have read the PTX ISA 9.0 for a while. And I found that it includes many special registers like `%tid`.

So I wonder if SR and GPR share the same register files or not. Can anyone tell me where’s the `SR`.

A general register is part of the register file for the thread. It is readable and writable, and useable for any purpose at the discretion of the compiler. A general register has no predefined meaning or definition as to what it contains.

A special register is not part of the register file of the thread. It does not contribute to the register usage (as reported by -Xptxas=-v) or register footprint of the thread. The ones that I can think of offhand are not writable. Special registers usually serve a specific purpose, for example to provide access to a clock or to provide access to fixed data such as the built-in thread variables, like thread ID and CTA ID.

Thanks for your reply. Since the `SR` is not part of the register file, where it locates on the hardwre–constant cache, text cache, or some independent cache?

For all/most special registers it makes no sense to be cached, because they are either fixed or not used for data exchange.

They typically are read through the variable latency instruction MIO interface, similar to shared memory, but without using the shared memory infrastructure.

The reason probably is (that is my head canon explanation) that it easily allows indexed access and the connected register is less hard-coded than the general purpose register ports. The lost bandwidth/latency is not critical, as special registers are seldomly used.

If you describe, what you are specifically interested in, general theoretical working, latency, concurrency, usefulness, …, we (people on this forum) could answer more directly (as far as published by Nvidia).

I want to know more about how the `SR` working on the memory hierarchy.

To a first order approximation I would expect it to be part of the SM hardware. I don’t have any information beyond that.

The primary way I know of to access a special register is via the S2R SASS instruction. (If you have a particular special register of interest, you should be able to code up inline PTX, then dump the SASS and see what it compiles to.) I would call this a register-to-register move. With respect to the memory hierarchy, I believe it belongs to the register level of the hierarchy, (i.e. not cache, not shared, not global), but as already indicated above it may have access similarities to other aspects of the memory hierarchy. I don’t know that all this is documented anywhere, but you may find references to it or descriptions of it on various forum posts.

There are PTX special registers and there are hardware special registers. PTX special registers are named registers that can be accessed in PTX that assemble down to a sequence of SM instructions. In most cases these optimize to one of the following:

  1. A read from constant bank 0 (e.g. gridDim (%ntaid), blockDim ($%tid)) which contains per launch constants.
  2. A move from a special register to a thread (S2R R0, SR_ClockLo, CS2R.64 R0, SR_Clock) or warp uniform register (S2UR UR4, SR_ClockLo).

For some PTX special registers additional instructions are required to retrieve a field in the read constant or read special register.

In the case of SR_<special_register> there are two common paths:

  1. The value is constant across all threads and available at the instruction pipe front end (e.g. SR_Clock).
  2. The value is varies per thread (e.g. %laneid, %tid) in which case the S2R instruction uses the shared memory cross bar to access a separate bank of special registers. These accesses show up in Nsight Compute metrics in the Shared Memory table. This is referenced in the Nsight Compute Kernel Profiling Guide (lsu pipeline description) and in counter descriptions such as the Shared Memory Table Other row tooltiop (“Shared memory traffic generated by other instructions, including S2R, SHFL, and fully predicated off shared memory accesses.”).

LDC, LDCU, S2R and S2UR are variable latency instructions. If the pipeline is stalled the dependent instruction will report a short scoreboard stall. CS2R is a fixed latency instruction useful for reading the wall clock timestamp (%globaltimer) and cycle count (%clock). CS2R.64 reads both the lower and upper 32-bit value in the same cycle.