When the NVCC compiler eliminates dead code, it seems to not consider the carry bit when it’s used across multiple inline-assembly blocks. Is this intended behavior?
It’s an issue in my code because I’m doing extended precision arithmetic, and in parts of my code I only need the most significant bits of the result. For example, when multiplying two 128-bit integers (consisting of 4 32-bit words each) the full result would be 256 bits (8 words). In places I only need the top 128 bits (4 words) of the result. However, some of the calculations that produce the lower 128 bits still need to be performed so that any carry bit values are propagated correctly.
Here’s a simple example:
__global__ void carryBitKernel(unsigned *a,
unsigned *b,
unsigned *result) {
unsigned tmp0, tmp1;
asm("add.cc.u32 %0, %1, %2;" // add with carry-out
: "=r"(tmp0)
: "r"(a[0]), "r"(b[0]));
asm("addc.u32 %0, %1, %2;" // add with carry-in
: "=r"(tmp1)
: "r"(a[1]), "r"(b[1]));
result[1] = tmp1;
}
Since tmp0 is never used, the first add instruction is omitted, and the carry bit value on which the second add instruction relies is not set. (I found this behavior on versions 4.2 and 5.0 of the nvcc compiler, when generating code for -arch=sm_20 or -arch=sm_30)
Combining the instructions into one ASM block fixes the problem:
asm("add.cc.u32 %0, %2, %4;"
"addc.u32 %1, %3, %5;"
: "=r"(tmp0), "=r"(tmp1)
: "r"(a[0]), "r"(a[1]), "r"(b[0]), "r"(b[1]));
result[1] = tmp1;
However, it would make the code much more flexible if each assembly instruction could be wrapped with a C function. For example:
[code]dest[0] = addCarryOut(a[0], b[0]);
for (int i=1; i