PTX carry prapagation issue

Hi.
If I have the folowing two functions for the device:

typedef unsigned int u32;
__device__ u32
__uaddo(u32 a, u32 b) {
	u32 res;
	asm("add.cc.u32 %0, %1, %2; /* inline */ \n\t" 
	    : "=r" (res) : "r" (a) , "r" (b));
	return res;
}

__device__ u32
__uaddc(u32 a, u32 b) {
	u32 res;
	asm("addc.u32 %0, %1, %2; /* inline */ \n\t" 
	    : "=r" (res) : "r" (a) , "r" (b));
	return res;
}

… and I call them in the folowing order:

u32 a = 0xffffffff;
u32 b = 0xf;
u32 s = __uaddo(a,b);
s = __uaddc(0,0); // s will be 0 (zero) although a+b doesn't fit in 32-bit

What is the reason for not having the carry flag set in the second call?
Someone said that “the flags are ephemeral”.
Isn’t the CC.CF flag a hardware one? Shouldn’t it be set until the next carry-out instruction?

As I explained over on Stackoverflow ([url]CUDA - PTX carry propagation - Stack Overflow), there can be no guarantees that a carry flag setting (or any flag setting for that matter) carries over from one separate asm() statement to the next. For example, the compiler could easily schedule another flag-setting instruction between the code resulting from the two asm() statements.

Your code either has to produce and consume the carry flag within the same asm() statement, or you have to export the carry flag state into a C variable in the earlier asm() statement, then import it from that variable in the subsequent asm() statement.

Hi again. I understand now that another process can easily change that flag between the two calls. Thanks

Not sure what you mean by “other process”. Other instructions from other source code statements could be interspersed with code from the asm() statements.

I think that when my code runs, GPU is not busy only with it. There are other instructions from other processes (suppose I have a video in vlc running that uses the GPU at the same time). Am I saying right? Is the word “process” inappropriate?

As Norbert explained, you need to make sure the ASMs get executed as a block since the compiler might reorder (or elide) instructions, and you depend on having both executed, in order, and with no other instructions inbetween that might change the flag. Norbert told you the solution as well: capture the data you want in a variable that nvcc can see, and/or combine the ASM statements into one monolithic block that nvcc can’t interfere with.

I wonder if in your case the first call to __uaddo is entirely optimized away by the nvcc compiler because you never use the return value. The NVCC compiler does not know that your black-box ASM has a side effect of setting the carry flag, so it’s deleting code that is seemingly unused. That’s just a guess though. Examining the PTX output of nvcc would tell you what’s happening.

The call to __uaddo() is in fact optimized away as dead code in this particular example, at least in release builds, and I pointed this out in the answer I gave on Stackoverflow.

__global__ void testing(u32* s) // s preallocated
{
	u32 a, b;

	a = 0xffffffff;
	b = 0xffffffff;
	
	s[1] = __uaddo(a,b);
	s[0] = __uaddc(0,0);
        //s[0] = __uaddc(1,1); // tryed this as well
}

This code gives the same result, that is no CC.CF is added, so I think it is not about the optimization.
So there are other instructions that may change the flag.

I tried this:

__device__ uint32_t __uaddo (uint32_t a, uint32_t b) 
{
    uint32_t res;
    asm ("add.cc.u32 %0, %1, %2;\n\t" : "=r" (res) : "r" (a) , "r" (b));
    return res;
}

__device__ uint32_t __uaddc(uint32_t a, uint32_t b) 
{
    uint32_t res;
    asm ("addc.u32 %0, %1, %2; \n\t" : "=r" (res) : "r" (a) , "r" (b));
    return res;
}

__global__ void kernel (uint32_t a, uint32_t b, uint32_t *res)
{
    uint32_t s;
    res[0] = __uaddo (a,b);
    res[1] = __uaddc (1,1);
}

The kernel translates to this:

/*0000*/         MOV R1, c[0x1][0x100];
       /*0008*/         MOV R0, c[0x0][0x20];          // a
       /*0010*/         MOV32I R2, 0x1;                // 1
       /*0018*/         MOV R4, c[0x0][0x28];          // load res<31:0>
       /*0020*/         IADD R0.CC, R0, c[0x0][0x24];  // __uaddo (a, b)
       /*0028*/         MOV R5, c[0x0][0x2c];          // load res<63:32>
       /*0030*/         IADD.X R2, R2, 0x1;            // __uaddc (1,1)
       /*0038*/         ST.E [R4], R0;                 // store res[0]
       /*0040*/         ST.E [R4+0x4], R2;             // store res[1]
       /*0048*/         EXIT;

The IADD with R0.CC sets the carry, and the IADD.X consumes the carry. Since there are no intervening instructions that modify the condition codes, the carry propagation happens to work as desired in this case, the results returned are 0000000e 00000003. Note the “3”, indicating the carry was added to the sum 1+1.

As I pointed out, this will not work in general (we do not always get so lucky) and the only safe way to rely on the carry flag is within the same asm() statement, or by moving the carry flag to a bound variable (this requires at least one additional instruction) which can be fed to a subsequent asm() statement later.

Good and complete explanation both here and on [url]CUDA - PTX carry propagation - Stack Overflow. I appreciate.