Carry flags and ILP

64 bit and higher integer addition is cleanly supported by PTX by using adds with explicit carry-in and carry-out flag modifiers. This is an obscure corner of CUDA, but with some interesting behavior.

The PTX carry-add instructions expose a surprising bit of implicit state in the form of the carry flag. The PTX manual in 8.7.2 merely says “…reference an implicitly specified condition code register (CC) having a single carry flag bit (CC.CF) holding
carry-in/carry-out or borrow-in/borrow-out.” [Side question: if there’s a “.CF” flag bit in this condition code register, are there even more implicit flags in this hidden register, like a record of floating point exceptions or underflow, etc?]

I’ve been looking into a subtle code behavior where building up extended precision integer adds becomes slower more rapidly than I would expect. ie, in my code a 128 bit addition takes slightly more twice as long as 64 bit addition. I have a theory that the implicit carry register is the limitation.

There is no documentation about the carry implementation. Looking at the PTX and SASS shows that the carry flag is not held by an explicit user register. Unlike CPUs, we have multiple warps executing interleaved instructions so we can’t have the ALU itself hold a bit of transient state. So where is it held? I theorize that the warp scheduler holds this bit of state per lane, similar to the way it holds the warp’s program counter (though that’s only one value per-warp).
But what are the implications?

Answer: it strongly interferes with ILP instruction scheduling. If the warp has only one set of condition flag registers, then it can’t schedule two different addition chains from the same warp at the same time even if they’re totally independent because there is only one set of carry condition flags.

This subtlety may explain why I’m seeing quite subtle slowdowns in my code… the longer the addition chain, the longer the carry flag is “monopolized” and less ILP opportunities exist. If you have chosen a smaller number of threads per block because you use ILP everywhere, then the use of carries will interfere with your instruction efficiency. This unfortunately describes my use, where I am using tons of registers per thread (Huzzah, 255 registers in sm_35!) and few threads with heavy ILP code interleaving. (All hail Volkov!)

So, my question: is my interpretation of the carry flags being stored as per-warp state correct? I suspect one awesome person here on the forum will know. He likely holds a patent (or 29) on it.

One of my long term side projects involves small-state PRNGs.

My observations from dabbling in multi-precision integer arithmetic match yours, including the fact that use of the carry flag appears to impede ILP (on those GPUs where ILP is a meaningful concept).

I do not know anything about the hardware implementation behind the carry flag, and it could well differ between GPU architecture generations which would explain the level of abstraction provided by PTX.

As far as I know, there are no floating-point exception bits of any kind. Floating-point operations provide the results defined by IEEE-754 for the case that exceptions are masked.

and thusly performance got carried away

Except in isolated instances I would expect the carry flag handling to have zero to very limited impact on performance. The performance of multi-precision integer codes I have worked with is typically dominated by integer multiplies, and thus integer multiply throughput tends to be the limiting factor before anything else.

It should also be noted that only with the most recent generation of GPUs is there a meaningful notion of ILP. I have yet to encounter a situation where more than an opportunistic approach to improve ILP is warranted. I gave a simple real-life example of such opportunism involving floating-point operations in an earlier thread that I cannot find now, so here is another one that hopefully is not too contrived: Assume you want to compute p(x)x**3, where ** denotes exponentiation. One could compute this as p(x)xxx, but computing this as t=x*x, p(x)tx improves ILP [based on my understanding of the relevant standards, a C/C++ compiler cannot transform one expression into the other, but a Fortran compiler could).

Norbert, thanks for confirming my suspicions! You’re right that in general such carry flag dependencies are ignorable in terms of performance, but my own tests are trying to benchmark performance accurately so they can detect the impact. Your trick about breaking up compute order to help ILP is also very valid! In fact it’s a (tiny but measurable) speed win for high bit integer counters to only use the carry bits for 64 bits of a counter, with an explicit if() test to continue to 96 or 128 bits.

Talking about ILP optimization in general, though, it’s easy to create practical cases where code depends on ILP’s efficiency in sm_35. The new, huge, 255 register limit now allows you to stage many many independent parallel computes per thread. The speed advantage is not so much from the ILP directly (you could get that with more warps instead), but from the ability to amortize compute or especially data reads WITHIN a thread (or warp) since all the data is accessible for free within the thread. This is definitely a new coding style (again enabled by sm_35) but both allanmac and I have embraced it. 255 registers per thread is the real win of sm_35, not the faster shifter or even Dynamic Parallelism.

The much larger register file of sm_35 can certainly give a huge boost to some codes. For example, it allows each thread to handle an entire small matrix, providing for excellent register blocking. The large register file per se does not provides a boost in terms of ILP, other than providing some headroom for additional temporary registers that code with increased ILP may require, for example when using Estrin’s method instead of Horner’s method to evaluate a polynomial.

Instruction level parallelism is a function of the amount of independent operations within the same thread, or the “inverse” of data and control dependencies. Programmers can positively influence the amount of ILP present, especially during floating-point computations where C/C++ compilers may be restricted in their transformations by language rules. How much of the theoretically available ILP then falls to the bottom line depends on how flexible multi-instruction issue rules are (per cycle and thread), and the compiler’s ability to schedule operations to exploit multi-issue rules as best as possible.

With sm_35, putting thought into how to exploit (and increase) ILP could be appropriate for expert programmers, but overall I still see it as a secondary optimization issue at this point, compared to primary performance issues like optimizing and minimizing data movement.

I wouldn’t exactly agree with the statement “The much larger register file of sm_35 can certainly give a huge boost to some codes”.

Compute 3.0 also had 64 K registers per SMX, but you had to launch a sufficient number warps (blocks) to use it fully. What actually scores in SM 3.5 here is the new instruction set that is able to encode 255 distinct registers into operands.

Sorry, I expressed myself poorly. I meant the larger number of registers available per thread, not the total number of registers available per SM. So I think we are all on the same page.