# PTX u32 wide multiplication How-to and performance characteristics?

Hi, again. Every time I come back here, my questions get tougher. I have a program where I multiply numbers of various sizes against a 64-bit number, and get high bits of the result. I recently figured out that, while [post=“1111581”]__umul64hi may take 10 cycles on Fermi[/post], I should be able to multiply a 32-bit number left-shifted 32 bits by a 64-bit number with only two multiply instructions. The first is a simple umulhi with an add, but the second I’m having trouble with:

[codebox]device uint64_t mad_wide_u32(const unsigned int a, const unsigned int b, const unsigned int c) {

uint64_t res;

asm(“mad.wide.u32 %0, %1, %2, %3;” : “=r” (res) : “r” (a) , “r” (b), “r” ( c ));

return res;

}[/codebox]

Every time I try to compile with that, I get this error:

### ASM operand does not satisfy its constraint r

So, first, what am I doing wrong there? More importantly, what can I do about it?

Second, can someone please verify or fix the things in parentheses in the following statement of what I believe the mad.wide.u32 statement does? (Or anything else that’s wrong?)

Edit: Answered one of my own questions: from the PTX ISA PDF: “If .wide is specified, then d and c are twice as wide as a and b to receive the result of the multiplication.” So my statement below is modified.

“I think that mad.wide.u32 takes (2 cycles) to multiply a 32-bit number by another 32-bit number, produce a 64-bit result, and then add a 64-bit number to the entirety of the 64-bit result.”

Thanks!

Hi, again. Every time I come back here, my questions get tougher. I have a program where I multiply numbers of various sizes against a 64-bit number, and get high bits of the result. I recently figured out that, while [post=“1111581”]__umul64hi may take 10 cycles on Fermi[/post], I should be able to multiply a 32-bit number left-shifted 32 bits by a 64-bit number with only two multiply instructions. The first is a simple umulhi with an add, but the second I’m having trouble with:

[codebox]device uint64_t mad_wide_u32(const unsigned int a, const unsigned int b, const unsigned int c) {

uint64_t res;

asm(“mad.wide.u32 %0, %1, %2, %3;” : “=r” (res) : “r” (a) , “r” (b), “r” ( c ));

return res;

}[/codebox]

Every time I try to compile with that, I get this error:

### ASM operand does not satisfy its constraint r

So, first, what am I doing wrong there? More importantly, what can I do about it?

Second, can someone please verify or fix the things in parentheses in the following statement of what I believe the mad.wide.u32 statement does? (Or anything else that’s wrong?)

Edit: Answered one of my own questions: from the PTX ISA PDF: “If .wide is specified, then d and c are twice as wide as a and b to receive the result of the multiplication.” So my statement below is modified.

“I think that mad.wide.u32 takes (2 cycles) to multiply a 32-bit number by another 32-bit number, produce a 64-bit result, and then add a 64-bit number to the entirety of the 64-bit result.”

Thanks!

Aha! A little (lot) more Googling finally turned up the answer to the first problem. For a 64-bit operand, it appears I need to use the constraint “l” instead of “r”. :) Source: Implementation of Multiple-precision Modular Multiplication on GPU - Kaiyong Zhao (PDF)

So here’s my little mad.wide.u32 function that at least compiles:

[codebox]// Clobbers C to save two registers and to make the example I found work.

device void mad_wide_u32(const unsigned int a, const unsigned int b, uint64_t &c) {

asm(“mad.wide.u32 %0, %1, %2, %0;” : “+l” ( c ) : “r” ( a ) , “r” ( b ));

}[/codebox]

Whether it takes two cycles, one cycle, or four cycles, I don’t know. I’m going to assume it takes two unless someone tells me otherwise. :rolleyes:

Aha! A little (lot) more Googling finally turned up the answer to the first problem. For a 64-bit operand, it appears I need to use the constraint “l” instead of “r”. :) Source: Implementation of Multiple-precision Modular Multiplication on GPU - Kaiyong Zhao (PDF)

So here’s my little mad.wide.u32 function that at least compiles:

[codebox]// Clobbers C to save two registers and to make the example I found work.

device void mad_wide_u32(const unsigned int a, const unsigned int b, uint64_t &c) {

asm(“mad.wide.u32 %0, %1, %2, %0;” : “+l” ( c ) : “r” ( a ) , “r” ( b ));

}[/codebox]

Whether it takes two cycles, one cycle, or four cycles, I don’t know. I’m going to assume it takes two unless someone tells me otherwise. :rolleyes:

Slightly off topic, but I’ve never seen this inline PTX asm directive before, as I usually write the entire routine in PTX, then use the CUDA driver API to call it. Inline PTX (asm) is not documented in any of the docs in the CUDA GPU Computing Toolkit. Searching the forums led me to this: http://forums.nvidia.com/index.php?showtop…1666&hl=asm, where it was recommended not to use it. That was about a year ago. I’m wondering if it’s a feature I can now use?

Slightly off topic, but I’ve never seen this inline PTX asm directive before, as I usually write the entire routine in PTX, then use the CUDA driver API to call it. Inline PTX (asm) is not documented in any of the docs in the CUDA GPU Computing Toolkit. Searching the forums led me to this: http://forums.nvidia.com/index.php?showtop…1666&hl=asm, where it was recommended not to use it. That was about a year ago. I’m wondering if it’s a feature I can now use?

Wow, someone actually won the fight with the compiler. Nice!
Personally I gave up trying long ago and I basically code in ptx using perl as a preprocessor and cudasm as a backend.
Long live low-level programming! Compilers never know enough.

Wow, someone actually won the fight with the compiler. Nice!
Personally I gave up trying long ago and I basically code in ptx using perl as a preprocessor and cudasm as a backend.
Long live low-level programming! Compilers never know enough.