Used Registers vs Live Registers

teamvraz · June 23, 2020, 10:48pm

Can somebody explain the difference between “Used Registers” (as reported in ptxas compiler output) and “Live Registers” (as displayed in Nsight Compute “Source and SASS”)?

My project shows 88 “used registers” at compile time but 79 “live registers” in the SASS disassembly. My target is <= 80 registers to get 24 threads per warp. While the live register count suggests I am there, the used register count is what gets used. This is with compute 7.5 and Nsight 2020.1 if that matters.

I tried experimenting with maxrregcount briefly, but it did not help. Compile project w/o maxrregcount and note used registers. Add maxrregcount=used-registers. It then adds stack and local variables even though the code already meets the register requirements. Set the register limit lower and it really gets ugly. Had seen references that it would start by backing off early register loading optimizations, but that was not my experience.

Thanks.

njuffa · June 24, 2020, 12:16am

Have you tried backing off the ptxas optimization level, e.g. -Xptxas -O2, -Xptxas -O1?

If this is single-precision code, --use-fast-math could also lower register usage, but beware of potential numerical issues.

teamvraz · June 24, 2020, 2:20am

Appreciate the prompt reply. Already using fast-math though entirely integer code so uncertain the impact. Have not tried reducing optimization levels, but will do so in my quest for 80 registers.

Still want to understanding why SASS shows 79 live registers versus the 88 used registers. If the code only needs max 79 live, where does the additional register overhead come from? Can think of multiple potential explanations, but just shots in the dark so far. Have found multiple examples of used > live so don’t believe this is specific to my code.

(Get the impression the compiler rounds-up register usage to the next logical break point-- 64, 72, 80, 88, etc. Have compiled kernels with 63 live registers that generated 64 used registers and some with 64 live registers that generated 72 used registers.)

njuffa · June 24, 2020, 3:41am

I could only speculate on your Y-problem (difference in register counts), so I was trying to address your X-problem (how to reduce register usage to allow more threads to run). Yeah, --use_fast_math isn’t going to benefit integer-only code. Does the code use 64-bit integers? The CUDA compiler isn’t always exceedingly efficient with those, as they need to be emulated.

Fact: The GPU hardware does not typically allocate registers with granularity 1. The actual granularity depends on the architecture. Speculation: The compiler backend, ptxas, which performs all architecture-dependent optimizations including register allocation, presumably knows about the relevant granularity and simply rounds up to the next hardware-defined boundary.

teamvraz · June 24, 2020, 5:09am

The actual granularity depends on the architecture.

Great point. Am using Turing and a very detailed third-party analysis claims it allocates per-thread registers in blocks of 8. However, pretty sure the user-registers reported by ptxas can be arbitrary numbers. Just compiled a few variant kernels to be certain, but one generated 126. Pretty certain I have seen odd-number user-registers reported (and have absolutely seen odd-number live registers).

Given ptxas generates code with arbitrary virtual registers, wonder if its used-registers is really a conservative estimated minimum SASS required register count (taking into account what is knows of the architecture, etc.). Suppose live registers might be what SASS actually requires, but for whatever reason (technical or otherwise), ptxas used registers is the authority for thread allocation purposes.

Had hoped to save a register or two via loop unrolling, but that does not seem to work predictably. Is almost counter-productive as the additional inline code provides the optimizer more opportunity to rearrange instructions to prevents stalls. Sometimes this has positive impact. Other times it burns register(s) for no measurable improvement (at the cost of warp occupancy). Wish you could hint the compiler which was more important for a particular block of code.

I tried -O2/-O1 but neither changes the used register count in my case. Thanks for the suggestions-- really helpful to have other ideas to explore.

njuffa · June 24, 2020, 5:25am

The front half of the CUDA compiler (EDG + LLVM) generates PTX. That is both a virtual instruction set and a compiler intermediate format in single static assignment (SSA) form. Every time a result is written, a new virtual register is used. ptxas on the other hand compiles PTX into machine code (SASS). Only at that stage is there register allocation, and those registers are actual hardware registers.

ptxas clearly must perform some sort of live-range analysis in order to do determine how many registers will be needed in addition to any fixed register usage prescribed by the ABI. But there is interaction with instruction scheduling. As you noted, the number of temp registers can go up when loads are scheduled early as a latency tolerance measure. So one could speculate that ptxas makes an initial estimate of how many registers it will want to target (say 64), and when all is said an done, the actual number required might be a little less, say 62 or 63.

In the distant past I discussed hints with the CUDA compiler guys and the feedback I got is that hints (e.g. branch likelihood) are not generally helpful in their experience and not worth implementing. Compilers are tools with many phases, each controlled by different heuristics, and while mature compilers (like CUDA at this point) generate good code for the majority of source codes, they do fall short for some specific codes.

Does your code use a lot of logical expressions? There are some trade-offs in the use of the LOP3 machine instruction that may change register usage slightly. The reason is that a LOP3 needs three source registers, so in general use of LOP3 reduces instruction count but slightly increases register pressure.

Consider posting your code, it is hard to speculate what could be responsible for the high register usage without source. I assume you have already worked through higher-level algorithmic changes and transformations that may have an impact on register pressure.

teamvraz · June 25, 2020, 5:04am

Solid analysis and we share similar thoughts. This is a learning experiment for me using ethash (lots of examples, well understood problem space and easy to validate correctness) in preparation for porting proprietary CPU/SIMD software defined radio code

The compile-time ptxas live-register analysis clearly differs from SASS (maybe due to the multiple version possibilities). For my turing device, the nsight compute occupancy graph is eight register multiples (corresponding to the turing architectural analysis I read). Seems like SASS might need 80-81 live registers while burning 89-90 psx estimated registers then rounded up to 96 physical registers-- yikes. Assume the issue scales with kernel register requirements.

To your query about specific code, here is an interesting example (with build instructions) that demonstrates the underlying issue. This github ethereum project (GitHub - ethereum-mining/ethminer: Ethereum miner with OpenCL, CUDA and stratum support) has solid performance and beats my custom code (to my dismay). Of the GPU examples I have reviewed, first is a thread-specific serial keccak/SHA3 pass followed (after distributing state) by the parallel thread “memory hard” portion followed by (after consolidating state) a thread-specific keccak/SHA3 final test.

The main CUDA kernel (ethminer/dagger_shuffled.cuh at master · ethereum-mining/ethminer · GitHub) is relatively straightforward, though it contains some magic number choices that I assume were trail/error to accommodate ptxas. Here is the mostly raw code with #defines replaced by literals. The bfe() asm makes zero difference and the offset[p] array can be replaced by a local variable. The main interesting bits are the three loops and their step values.

for (uint32_t a = 0; a < 64; a += 4)
{
int t = bfe(a, 2u, 3u); // (a>>2)&7
for (uint32_t b = 0; b < 4; b++)
{
for (int p = 0; p < 4; p++)
{
offset[p] = fnv(init0[p] ^ (a + b), ((uint32_t*)&mix[p])[b]) % d_dag_size;
offset[p] = SHFL(offset[p], t, 8);
mix[p] = fnv4(mix[p], d_dag[offset[p]].uint4s[thread_id]);
}
}
}

A cleaner version with more explicit loop details:

#pragma nounroll
for (uint32_t a = 0; a < 64; a += 4)
{
#pragma unroll
for (uint32_t b = 0; b < 4; b++)
{
#pragma unroll
for (int p = 0; p < 4; p++)
{
offset[p] = fnv(init0[p] ^ (a + b), ((uint32_t*)&mix[p])[b]) % d_dag_size;
offset[p] = SHFL(offset[p], (a>>2)&7, 8);
mix[p] = fnv4(mix[p], d_dag[offset[p]].uint4s[thread_id]);
}
}
}

By default, small loops are unrolled and large loops are not. Under my turing/7.5, the non-pragma/pragma versions give identical performance. ptxas requires 80 used-registers which means 24/32 warp threads-- important since this code requires thread groups of 8. Full wrap occupancy would be interesting, but that is impossible-- no way go from 80 to 64. The opposite is easily possible-- unroll the main loop. In a perfect world, that wold eliminate the loop variable (along with some calculations that are now constants) giving a couple more registers and eliminating the loop structure (understanding the memory intensive nature would limit compute optimizations). Might add a few instruction cache misses (official documentation is sparse in this area), but doubt it is material.

Change the main loop #pragma from “nounroll” to “unroll” and ptxas used registers jumps from 80 to 90 and overall performance drops by 2%. The cause of the register bloat is the transition from the memory lookup phase to the state consolidation phase. The change in instruction mix allows using early register loads to reduce some stalls (though with negative payoff).

Number one requirement of localized optimizations is not making things worse (or providing the ability to disable locally if that is a possibility). Because ethash is memory hard by design, the remaining performance comes down to table scraps from other small optimizations (or lack thereof). Curious how this plays out for SDR, but expect similar. Algorithmic optimization against the main bottleneck and then fighting for small gains from whatever is left.

Played with one idea that showed promise to limit the scope of early register loads, but no joy. Created custom versions of shft_sync and xor2 using asm volatile, hoping the latter would prevent ptxas/sass from earlier scheduling of certain instructions (effectively creating a barrier) and in turn reducing register pressure. Turned into a game of whack-a-mole. It influenced register loads, but not with predictability. I want to work with the compiler, not fight against it. (Apologies for the code formatting-- the forum software seems to be missing a literal/code format.)

njuffa · June 25, 2020, 2:44pm

Apologies for the code formatting-- the forum software seems to be missing a literal/code format

For the markup of code, you can either use indentation by four spaces, like on Stackoverflow, or you can wrap the code in [code][/code]. Doing this right now seems to have the interesting side effect that the former simply preserves indentation but the latter also adds syntax highlighting. You can also highlight the text portion you want treated as code and click on the </> button at the top of the input window.

int foo (float bar)
{
    return bar * bar;
}

float bar (int baz)
{
     return baz * 3.14f;
}

As you discovered, you cannot influence instruction scheduling by the compiler in a meaningful or predictable fashion. In general the CUDA compiler makes good decisions when it comes to that, although in a given case the choices may not be perfect. That’s no different from other compilers, though. The impact of the loop unrolling is easy to understand I think. After flattening the code the compiler sees additional opportunities for optimization. That the trade-off it makes isn’t perfect (2% normally falls under “noise level” in my book) is par for the course. Have you tried partial unrolling? That’s just “waving the rubber chicken” at the issue, but worth a quick trial.

I am wondering about the % d_dag_size in the innermost loop. Modulo isn’t exactly a cheap operation as it is emulated (the GPU has no native support for any kind of division), taking up instructions and registers. Is that a fully generic modulo operation, or is there a limited number of fixed divisors in use? I am also wondering about the use of switch. This may be outdated knowledge, but in my recollection switch didn’t result in the best code.

teamvraz · June 26, 2020, 5:39am

Appreciate the code markup feedback-- will use for future posts. In some cases I might also consider 2% noise, but not for a deliberately difficult problem that is run at scale. Some optimizations are always break-even or better so controlling them provides debatable value. Because registers are such a critical resource, not providing some control over the allocation process strikes me as a serious shortcoming (my personal opinion). Seems like maxrregcount was intended to be that control, but appears very broken.

Good observation on d_dag_size. This value changes every few days. I ran a prior test replacing it with its literal value (at the time) and believe it was <0.25% improvement (but it might have saved a register). It is not a power-of-2 or any easy to factor number so I left it alone after that experiment.

The switch statement also gave me pause. The contributor noted that it improved performance on some architectures (though that was several years back). It is actually a problem for Nsight Compute as the SASS disassembly fails due to a symbol table lookup issue. The switch can be replaced by discrete “if (threadIdx.x & mask)” statement to resolving the SASS disassembly issue. That code takes so few cycles (compared to the sha3/keccak and dag memory traversal) that I cannot imagine how any particular approach would have measurable impact.

I tried converting a chunk of my version directly to ptx assembly, but so far unable to meaningfully impact the ptx used-register estimator or the SASS live-register count (easy to increase register usage, but can only reduce by one). Will poke at it more tomorrow, but starting to run low on ideas. Certainly meeting my expectations as a great learning experience.

njuffa · June 26, 2020, 5:59am

The CUDA toolchain has gone through multiple major changes (proprietary frontend to Open64 to LLVM), and -maxrregcount was an early mechanism to influence the number of registers used for device code, on a per compilation unit basis. The reasons for this were twofold: Early GPU architectures were incredibly register poor compared to their modern counterparts, so the issue of register pressure and its interaction with occupancy loomed frequently. On top of that all parts of the compiler were quite immature, and there was an “impedance mismatch” at the PTX interface, ultimately leading to suboptimal code with suboptimal register usage.

Later, a more flexible mechanism was invented that allows control on a per function basis. This is the __launch_bounds() attribute. With modern register-rich architectures and a much more mature tool chain I have not had a need to use that in years.

When you start squeezing the compiler down by a few (about 2 to 3) registers, the usual reaction is that it starts reducing optimizations. If you squeeze by more than a few registers, it will often start to spill. Both of these responses cause performance degradation, which is counterbalanced (one hopes) by increased occupancy. But depending on what the code does and where its bottlenecks are , increasing occupancy may result in only a tiny improvement, or none at all. Today’s CUDA compiler makes excellent choices regarding these trade-offs, and it is hard to beat, as at least I convinced myself years ago, at which point I stopped playing with -maxrregcount and __launch_bounds().

Good to hear you already tried an experiment with division by literal constant. In isolation, I would expect it to reduce register use by two registers, but the modulo operation might not occur at the peak count of live registers here. Division by constant is very cheap, basically an integer multiply and a shift, even if the divisor has no special structure at all. This has been a standard optimization in compilers since the mid 1990s.

Have you looked at your code with the CUDA profiler in detail? The analysis provided by it has been solid for many years, and it is much better a pinpointing bottlenecks compared to simple static code analysis.

teamvraz · June 27, 2020, 4:49am

“impedance mismatch”-- that really resonates ;-). Adapting an open-source CPU oriented compiler framework to GPU is no simple task and not trying to take away from the complexity involved. Only done a couple trivial experiments with launch_bounds because register count was not an explicit parameter. When working against the uncertain and already encountering inconsistent behavior, want to decrease variables. Agree that playing with maxrregcount and launch_bounds is currently fruitless, though our reasons differ.

“Division by constant is very cheap, basically an integer multiply and a shift”. Even as a variable constant, SASS deconstructs into a few multiplies, a couple additions and some compares/select (apparently the decomposition of PTX rem). Have implemented long-division on 8-bit CPUs and reciprocal division on later CPUs, but could not touch this. The compiler is very good at math.

“Have you looked at your code with the CUDA profiler in detail ?” Not beyond Nsight Compute and my own PTX/SASS code analysis. For whatever reason, nvprof did not install properly and insufficient time to figure out why (likely a Windoze issue).

In the “most annoying of the day”, combined (threadIdx.x & 7) plus a base address into a single resigter. Figured double use for data addressing plus and-logic-compare on the low-bits (replacing that switch statement and similar). Was flabbergasted when the compiler realized the low-bits were actually thread-id and saved a separate (useless) register to perform the logic against. So smart and so dumb at the same time.

And if at first you don’t succeed, get a bigger hammer. Realized that predicates could be my friend. Multiple steps move data between threads based on thread-id. Instead of letting PTX/SASS flail with extra registers and dubious logic, put thread-id bits 0-2 into p0-p2 (8-thread cooperative) for the kernel duration.

Since predicates are a not a C/C++ concept, coded some custom PTX select-of intrinsics. The compiler is fighting back hard with massive performance/register variations depending on select-of implementation. @Predicate enable of mov (mux-channel sort-style) results in branch-carnage that is difficult to unsee. A simple 7-instruction selp 4/2/1 reduction gives the best register reduction (even converting a single transition). Using both selp predicate-enable and predicate-select reduces to five instructions, but the compiler is less happy.

Some curious learning from todays experiments:

According to SASS, LOP3 can produce a predicate result, but I cannot make it happen from PTX. Must use discrete logic followed by setp and then SASS reduces to a single LOP3 instruction.
The used-register output from ptxas appears to be the register limit for SASS. Though live-register usage might be (significantly) lower, look around the SASS peak and the highest register number is close to the ptxas used-register limit.
Seems like the round-up of used-registers to architectural register allocation is wasted. Need to double verify, but if ptxas says 89 registers are required (with an 8-alignment register architecture), SASS should figure out how to use 96 registers.
Make shfl work with predicates. If the pseudo-code has any relation to the implementation, eliminate the “&& isGuardPredicateTrue(Thread))”. Tried a workaround by to make source-lane invalid when result-not-wanted (not trigger the predicate address error) but shifted to my select-of approach as more short-term productive.

njuffa · June 27, 2020, 5:19am

Re: predication

This is just based on my personal observations, not sure whether others are seeing the same thing. Backends for older architectures made heavier use of predication, while the backends for newer architectures appear to prefer select-type instructions. No idea why that is. The downside in cases like yours is slightly increased GPR pressure when selects are used.

teamvraz · June 27, 2020, 6:18am

Even without my “encouragement”, the backend likes predicates for this application (though ROI unclear). No idea if this has changed over time for CUDA. It is a great concept, but predates older languages and requires compiler/assembler identification to leverage.

One challenge for this toolchain (and many others before), is failing to weigh the obfuscation factor of the “optimizations” against the benefit. With older compilers, changing the optimization level would provide some insight. PTX optimization levels appear to have very limited impact on SASS. Are there optimization levels/restrictions for SASS?

njuffa · June 27, 2020, 6:53am

Are there optimization levels/restrictions for SASS

I seem to recall mentioning the optimization levels for ptxas somewhere near the start of this thread. The default is -Xptxas -O3, and you can dial it down all the way to -Xptxas -O0.

Various nvcc switches and maybe some pragma are possibly passed all the way down to ptxas, possibly as directives inside the PTX code. I haven’t looked that stuff in many years.

As for __launchbounds, the reason no register counts are mentioned explicitly with this attribute is because this was designed to be portable across different GPU architectures. What you see in place today is the best solution we could come up with after discussing and experimenting with this internally for quite a long time.

teamvraz · June 28, 2020, 4:44am

Apologies on the optimization levels-- had previously tried but seemed to have no effect. Turned out the build script was missing the --ptxas-options= prefix so changing the levels had no impact. Correctly specified: O0=93 registers, code with tons of bogus moves (move R3, R3; ,etc) running around 75% speed. O1 strips dead code and allocates extra registers to hide stalls. O1=165 registers and 0.5% slower (surprising since only 8/32 warp threads). O2 does register optimization (90 regs) and runs full speed. O3 appears to perform peephole optimization with the same register count and performance.

While experimenting, I found a leak in the register lifetime analysis code. No idea if related to my “why don’t used-registers and live-registers” match. Understandable why the compiler struggles with it.

Part of this algorithm requires moving data between threads. One pattern involves a 8 shuffles into a temp array (a different register source for each shuffle) and then each thread picking the value of interest and saving it. Must be done properly to avoid an array indexing situation. That interesting switch-statement code was one approach. Since shuffle does not support predicates (sure would be a nice addition), I implemented a select-1of8 using precalculated predicate registers. Three predicates allows a five instruction select-8 but with side effects and register dependencies. This was my first attempt and it worked fine. Five predicates allows a four instruction select-8 without side effects and without any internal register dependencies. When changing this version to remove the side-effects, my used-register count went from 90 to 155:

DEV_INLINE uint32_t select8(uint32_t s0, uint32_t s1, uint32_t s2, uint32_t s3, uint32_t s4, uint32_t s5, uint32_t s6, uint32_t s7)
{
    uint32_t r;
    asm("@tid0 selp.u32 %0, %1, %5, !tid4;\n\t"
        "@tid1 selp.u32 %0, %2, %6, !tid4;\n\t"
        "@tid2 selp.u32 %0, %3, %7, !tid4;\n\t"
        "@tid3 selp.u32 %0, %4, %8, !tid4;\n\t"
        : "=r"(r)
        : "r"(s0), "r"(s1), "r"(s2), "r"(s3), "r"(s4), "r"(s5), "r"(s6), "r"(s7));
    return r;
}

A little bit of head scratching until I guessed that the compiler was uncertain if r was always initialized due to all assignments being predicated. Changed to “uint32_t r = 0;” but that did not resolve it (and did not even emit a PTX instruction). Added an explicit “mov.u32 %0, 0;” to finally resolve. With that change, SASS strips the predicate off the first selp instruction apparently to force initialization. (Likewise, can remove one of the predicates and make that select come first to resolve.)