Carry bit not considered when compiler removes unused code

When the NVCC compiler eliminates dead code, it seems to not consider the carry bit when it’s used across multiple inline-assembly blocks. Is this intended behavior?

It’s an issue in my code because I’m doing extended precision arithmetic, and in parts of my code I only need the most significant bits of the result. For example, when multiplying two 128-bit integers (consisting of 4 32-bit words each) the full result would be 256 bits (8 words). In places I only need the top 128 bits (4 words) of the result. However, some of the calculations that produce the lower 128 bits still need to be performed so that any carry bit values are propagated correctly.

Here’s a simple example:

__global__ void carryBitKernel(unsigned *a,
			       unsigned *b,
			       unsigned *result) {
  unsigned tmp0, tmp1;

  asm("add.cc.u32 %0, %1, %2;"  // add with carry-out
      : "=r"(tmp0)
      : "r"(a[0]), "r"(b[0]));
  asm("addc.u32 %0, %1, %2;"    // add with carry-in
      : "=r"(tmp1)
      : "r"(a[1]), "r"(b[1]));

  result[1] = tmp1;
}

Since tmp0 is never used, the first add instruction is omitted, and the carry bit value on which the second add instruction relies is not set. (I found this behavior on versions 4.2 and 5.0 of the nvcc compiler, when generating code for -arch=sm_20 or -arch=sm_30)

Combining the instructions into one ASM block fixes the problem:

asm("add.cc.u32 %0, %2, %4;"
      "addc.u32   %1, %3, %5;"
      : "=r"(tmp0), "=r"(tmp1)
      : "r"(a[0]), "r"(a[1]), "r"(b[0]), "r"(b[1]));

  result[1] = tmp1;

However, it would make the code much more flexible if each assembly instruction could be wrapped with a C function. For example:

[code]dest[0] = addCarryOut(a[0], b[0]);
for (int i=1; i

As far as I understand it, this happens because the compiler can’t analyze what goes on inside an assembly block, which it treats as a black box. It would do the same thing if you forgot the output register on an assembly block. To get this to work, you would have to explicitly tell it that the first asm statement writes the carry bit and the second asm statement reads it. I’m not sure what the right syntax for this would be though. Maybe something like this:

unsigned tmp0, tmp1, carry;

asm("add.cc.u32 %0, %1, %2;" // add with carry-out
: "=r"(tmp0), "r"(carry)
: "r"(a[0]), "r"(b[0]));
asm("addc.u32 %0, %1, %2;" // add with carry-in
: "=r"(tmp1)
: "r"(a[1]), "r"(b[1])), "r"(carry);

result[1] = tmp1;

?

To do this automatically, the compiler would need to have an internal representation of the assembly (PTX in this case). However, this goes against a core design principle of a retargetable compiler: that it should be possible to optimize a program without knowing anything about the target ISA. So I wouldn’t expect this to ever work automatically.

Another approach that should work is to move the local into the inline assembly body. The compiled output appears to be correct.

For correctness it might also be appropriate to throw a volatile onto the front of the add.cc block in case a future version of the inline assembler+compiler becomes smart enough to realize that tmp0 isn’t being used but not smart enough to realize that the carry-bit is being used. Right now it seems to just work. :)

__global__ void carryBitKernelLocal(unsigned *a,
                                    unsigned *b,
                                    unsigned *result)
{
  unsigned tmp1;

  asm("{                          "
      "  .reg .u32  tmp0;         "
      "  add.cc.u32 tmp0, %0, %1; " // add with carry-out
      "}"
      : 
      : "r"(a[0]), "r"(b[0]));
  asm("addc.u32 %0, %1, %2;"        // add with carry-in
      : "=r"(tmp1)
      : "r"(a[1]), "r"(b[1]));

  result[1] = tmp1;
}

The SASS looks good too (reordered for clarity):

/*0020*/     /*0x00601c8580000000*/ 	LD R0, [R6];
/*0030*/     /*0x0050dc8580000000*/ 	LD R3, [R5];
/*0048*/     /*0x003fdc0348010000*/ 	IADD RZ.CC, R3, R0;
/*0038*/     /*0x10609c8580000000*/ 	LD R2, [R6+0x4];
/*0050*/     /*0x10511c8580000000*/ 	LD R4, [R5+0x4];
/*0060*/     /*0x08409c4348000000*/ 	IADD.X R2, R4, R2;

An approach that does not work is simply declaring that first inline assembly block to be volatile (“asm volatile”). The volatile results in the PTX sequence being dutifully generated but it appears to reference a non-existent register %r1 (tmp0) which was removed by nvcc.

__global__ void carryBitKernelVolatile(unsigned *a,
                                       unsigned *b,
                                       unsigned *result)
{
  unsigned tmp0, tmp1;

  asm volatile ("add.cc.u32 %0, %1, %2;"  // add with carry-out
                : "=r"(tmp0)
                : "r"(a[0]), "r"(b[0]));
  asm("addc.u32 %0, %1, %2;"              // add with carry-in
      : "=r"(tmp1)
      : "r"(a[1]), "r"(b[1]));

  result[1] = tmp1;
}

The compiler considers code without data dependency as dead code and removes it. There is no way I know of for expressing a data dependency through the carry flag with existing asm() bindings. As far as I understand this is a generic limitation of gcc-style inline assembly.

To split multi-word arithmetic across multiple asm() statements, you would have to save the carry in a general purpose register (e.g. with “addc.u32 %1, 0, 0”) and bind it to some variable with “=r”, then bind that variable as an input in the following asm() statement. Workable, but not very efficient.

The solution, as you already noticed, is to place all instructions for a given multi-word arithmetic operation inside a single asm() block. Since the code is just a string that can be concatenated from as many individual string fragments as needed, this approach is straightforward and eminently amenable to machine-generated code.

[later:]

Thinking about it a bit more, I don’t see how any system of inline assembly could keep the carry flag or indeed any other register alive between separate asm() blocks. The fact that the contents of all registers is volatile between two separate asm() statements means any piece of data one wants to carry between two asm() statements must be saved in a HLL variable. Since the carry flag (and other flags, if an architecture supports them) is accessible only indirectly, it is the programmer’s responsible to move it to a place, such as a general purpose register or memory location, where it can be bound to a HLL variable.

BTW, if your goal is to implement 128-bit integer adds and multiplies, you might want to adopt the code I posted over at StackOverflow some while back:
[url]128 bit integer on cuda? - Stack Overflow