Hi, again. Every time I come back here, my questions get tougher. I have a program where I multiply numbers of various sizes against a 64-bit number, and get high bits of the result. I recently figured out that, while [post=“1111581”]__umul64hi may take 10 cycles on Fermi[/post], I should be able to multiply a 32-bit number left-shifted 32 bits by a 64-bit number with only two multiply instructions. The first is a simple umulhi with an add, but the second I’m having trouble with:

[codebox]**device** uint64_t mad_wide_u32(const unsigned int a, const unsigned int b, const unsigned int c) {

uint64_t res;

asm(“mad.wide.u32 %0, %1, %2, %3;” : “=r” (res) : “r” (a) , “r” (b), “r” ( c ));

return res;

}[/codebox]

Every time I try to compile with that, I get this error:

### Assertion failure at line 2025 of …/…/be/cg/NVISA/cgtarget.cxx:

### Compiler Error in file /tmp/tmpxft_00003f5c_00000000-7_appcu.cpp3.i during Code_Expansion phase:

### ASM operand does not satisfy its constraint r

So, first, what am I doing wrong there? More importantly, what can I do about it?

Second, can someone please verify or fix the things in parentheses in the following statement of what I believe the mad.wide.u32 statement does? (Or anything else that’s wrong?)

Edit: Answered one of my own questions: from the PTX ISA PDF: “If .wide is specified, then d **and c** are twice as wide as a and b to receive the result of the multiplication.” So my statement below is modified.

“I think that mad.wide.u32 takes (2 cycles) to multiply a 32-bit number by another 32-bit number, produce a 64-bit result, and then add a 64-bit number to the entirety of the 64-bit result.”

Thanks!