PTX addition with carry in carry out instructions and for loops

vonneumann · January 9, 2014, 9:53am

Hi all,
I coded some simple operations on large integers in CUDA 5 like addition and subtractions using
inline PTX and addc.u32 addc.cc.u32 instructions to handle carries.
I noticed that the carry flag is not always preserved across PTX add instructions if they are inside a for loop. I think that the for loop trip counter is increased/decreased by the compiler with instructions that affect the carry flag.
Does anyone have more insight on this and how to obviate without using extra instructions to save the carries?

njuffa · January 9, 2014, 7:46pm

To preserve the contents of any register between different asm() instances the register needs to be bound to a C-level variable. I am guessing that your code consists of a C-level loop, the body of which contains an asm() statement. To transport register data in the asm() between loop iterations you will need to bind all relevant data to C-level variables, including the carry flag where needed.

One thing you could do is to code the loop inside the asm() statement. On modern GPUs (sm_20 and later), the loop would be controlled by a predicated branch, and since predicate registers are separate from the carry flag mechanism, there would be no interaction. I am not sure whether this can be made to work with sm_1x, as that uses flag registers, and so the conditional branching for the loop could conflict with the carry flag use in the arithmetic, i.e. the same problem as exists on an x86 CPU.

The approach I have used and I would recommend is to use simple straight line sequences of ADD and ADDC to construct long integer additions.

vonneumann · January 10, 2014, 8:19am

Hi njuffa,
thanks a lot for poignant reply.

The approach I have used and I would recommend is to use simple straight line sequences
of ADD and to construct long integer additions.
That is what I do currently.
As I have compilation times on the order of 20-30 minutes, I wanted to try put these ADD ADDC sequences inside loops to reduce the code size and hope in a reduction of the compilation time at a negligible cost of performance.
I have a bunch of inline functions implementing various multi-precision integer operations.
I have different limb sizes for my operands so I have several versions of the same operation for different limb sizes (from 64 bits to around 300 bits).
It is a huge code and I believe that is why compilation is so long.

njuffa · January 10, 2014, 6:26pm

As far as lengthy compilation times on the order of 20 to 30 minutes are concerned: it would be helpful if you could file a bug so the compiler team can have a look at this issue. Thanks!

Topic		Replies	Views
Integer carry chains CUDA Programming and Performance	5	8705	November 18, 2010
PTX carry prapagation issue CUDA Programming and Performance	9	1488	March 28, 2016
Assembler instructions on 80xx platform CUDA Programming and Performance	1	4396	June 19, 2007
Writing a function in PTX? Need to hand-code a function in PTX CUDA Programming and Performance	3	3325	September 10, 2008
Carry bit not considered when compiler removes unused code CUDA Programming and Performance	3	1320	January 5, 2013
Carry flags and ILP CUDA Programming and Performance	7	2092	July 5, 2013
How add/sub long numbers with PTX CUDA Programming and Performance	2	8015	June 6, 2011
Large Integers on CUDA CUDA Programming and Performance	2	8917	June 25, 2010
Big Integer Arithmetic Anyone trying to do bign ints on CUDA? CUDA Programming and Performance	2	9823	February 19, 2008
Why are 64 bit integer operations broken into 2 32 bit ops? CUDA Programming and Performance	5	17303	February 17, 2011

PTX addition with carry in carry out instructions and for loops

Related topics