64 bit and higher integer addition is cleanly supported by PTX by using adds with explicit carry-in and carry-out flag modifiers. This is an obscure corner of CUDA, but with some interesting behavior.
The PTX carry-add instructions expose a surprising bit of implicit state in the form of the carry flag. The PTX manual in 8.7.2 merely says “…reference an implicitly specified condition code register (CC) having a single carry flag bit (CC.CF) holding
carry-in/carry-out or borrow-in/borrow-out.” [Side question: if there’s a “.CF” flag bit in this condition code register, are there even more implicit flags in this hidden register, like a record of floating point exceptions or underflow, etc?]
I’ve been looking into a subtle code behavior where building up extended precision integer adds becomes slower more rapidly than I would expect. ie, in my code a 128 bit addition takes slightly more twice as long as 64 bit addition. I have a theory that the implicit carry register is the limitation.
There is no documentation about the carry implementation. Looking at the PTX and SASS shows that the carry flag is not held by an explicit user register. Unlike CPUs, we have multiple warps executing interleaved instructions so we can’t have the ALU itself hold a bit of transient state. So where is it held? I theorize that the warp scheduler holds this bit of state per lane, similar to the way it holds the warp’s program counter (though that’s only one value per-warp).
But what are the implications?
Answer: it strongly interferes with ILP instruction scheduling. If the warp has only one set of condition flag registers, then it can’t schedule two different addition chains from the same warp at the same time even if they’re totally independent because there is only one set of carry condition flags.
This subtlety may explain why I’m seeing quite subtle slowdowns in my code… the longer the addition chain, the longer the carry flag is “monopolized” and less ILP opportunities exist. If you have chosen a smaller number of threads per block because you use ILP everywhere, then the use of carries will interfere with your instruction efficiency. This unfortunately describes my use, where I am using tons of registers per thread (Huzzah, 255 registers in sm_35!) and few threads with heavy ILP code interleaving. (All hail Volkov!)
So, my question: is my interpretation of the carry flags being stored as per-warp state correct? I suspect one awesome person here on the forum will know. He likely holds a patent (or 29) on it.
One of my long term side projects involves small-state PRNGs.