Understanding uniform registers

As mentioned in the following topics, I also encountered nvcc compiling my kernels to use “uniform registers”. From the paper referenced in the first discussion, I understand this optimization enhances the main datapaths to increase arithmetic throughput, with uniform registers holding identical data accessible to each thread:

While I can understand how my kernel design is being optimized by routing arithmetic instructions through a separate datapath (even before all available regular registers might be utilized), additional clarification on the following would help me better understand this optimization:

  • Does the Ampere tuning guide’s stated max. 255 registers per thread include or exclude uniform registers?
  • Can I explicitly control when the compiler uses uniform registers? I.e. can I prevent it from using uniform registers?
  • If not, what explicit kernel characteristics trigger the compiler to potentially use separate datapaths for integer-only use (and thus uniform registers)? E.g. loops of size N, M identical instructions inside a (unrolled) loop, …?

This understanding would enable me to make more informed statements about the performance estimates I’m deriving from my microbenchmarking results.

Thank you for any insights or additional references on this topic.

I don’t think there is any documentation to fully answer this question.

Up to 255 registers can be used, even without making use of the uniform datapath. The tuning guide is referring to “ordinary” register usage. AFAIK NVIDIA does not publish how many uniform registers are available and/or what UR limits the compiler may be adhering to.

No, the NVIDIA toolchain provides no controls to modify Uniform datapath usage.

I don’t think there is any documentation to fully answer this question. As indicated in the places you linked, the Uniform datapath may be used by the compiler when there is a mix of integer and floating point instructions that can be better scheduled when the uniform datapath is used.

You can always file a bug to request changes to CUDA documentation.

For some additional speculation on my part, as suggested here (slide 10), the uniform datapath seems to be employable when the compiler can detect that the data being used for a computation is warp-uniform. The use of “reverse vectorization” there seems to me to suggest that perhaps only a single hardware unit (a “scalar unit”) is doing the computation (since the data is warp-uniform) and then somehow supplying its result back to all threads in the warp, via the UR system. Thus, for a small impact on HW (1/32 of the normal ALU capacity required, considering things from a chip real estate perspective), instructions in the uniform path can still be scheduled. By pushing such activity to the uniform datapath, the compiler can free up scheduling slots in the main datapath.

This is probably also valuable given that certain functional units provide capabilities for different classes of instructions, eg. some functional units (e.g. FMA path on some architectures) can deliver either integer or floating point throughput. Scheduling integer ops in a separate path (when possible) allows those functional units to be more frequently used for floating point ops.

1 Like

Unofficially, it seems that 255 is the total limit, regardless of register type. See page 33 of the paper referenced in your first link.

On Turing - Blackwell the number of uniform registers per warp is fixed per SM. Uniform registers are not dynamically allocated. Uniform registers have no impact on the max registers per thread calculation.

Turing has 64 uniform registers per warp.

2 Likes

I guess uniform registers are less useful for direct arithmetic throughput and more to free up per-thread registers. The 63 (64 with constant zero) are in addition to the 255 (256 with constant zero).

Meaning if you have enough per-thread registers, the uniform path will barely speed up the application. It would have to be predominantly integer instructions to see a difference.

As you see on the Hot Chips slide, the warp scheduler still has to schedule the uniform instructions (1 scheduled instruction/clock max).

To avoid those for your micro-benchmarks dynamically assign operands per thread.

Uniform instructions are typically used for loop iterators and index computations. Those you typically don’t want to measure in micro-benchmarks anyway.

I agree. in a 1 op/clk scheduling regime, my comments on scheduling slots probably don’t make much sense.