PTX addition with carry in carry out instructions and for loops

Hi all,
I coded some simple operations on large integers in CUDA 5 like addition and subtractions using
inline PTX and addc.u32 addc.cc.u32 instructions to handle carries.
I noticed that the carry flag is not always preserved across PTX add instructions if they are inside a for loop. I think that the for loop trip counter is increased/decreased by the compiler with instructions that affect the carry flag.
Does anyone have more insight on this and how to obviate without using extra instructions to save the carries?

To preserve the contents of any register between different asm() instances the register needs to be bound to a C-level variable. I am guessing that your code consists of a C-level loop, the body of which contains an asm() statement. To transport register data in the asm() between loop iterations you will need to bind all relevant data to C-level variables, including the carry flag where needed.

One thing you could do is to code the loop inside the asm() statement. On modern GPUs (sm_20 and later), the loop would be controlled by a predicated branch, and since predicate registers are separate from the carry flag mechanism, there would be no interaction. I am not sure whether this can be made to work with sm_1x, as that uses flag registers, and so the conditional branching for the loop could conflict with the carry flag use in the arithmetic, i.e. the same problem as exists on an x86 CPU.

The approach I have used and I would recommend is to use simple straight line sequences of ADD and ADDC to construct long integer additions.

Hi njuffa,
thanks a lot for poignant reply.

The approach I have used and I would recommend is to use simple straight line sequences
of ADD and to construct long integer additions.
That is what I do currently.
As I have compilation times on the order of 20-30 minutes, I wanted to try put these ADD ADDC sequences inside loops to reduce the code size and hope in a reduction of the compilation time at a negligible cost of performance.
I have a bunch of inline functions implementing various multi-precision integer operations.
I have different limb sizes for my operands so I have several versions of the same operation for different limb sizes (from 64 bits to around 300 bits).
It is a huge code and I believe that is why compilation is so long.

As far as lengthy compilation times on the order of 20 to 30 minutes are concerned: it would be helpful if you could file a bug so the compiler team can have a look at this issue. Thanks!