PTX u32 wide multiplication How-to and performance characteristics?

Ken_g6 · October 10, 2010, 4:15pm

Hi, again. Every time I come back here, my questions get tougher. I have a program where I multiply numbers of various sizes against a 64-bit number, and get high bits of the result. I recently figured out that, while [post=“1111581”]__umul64hi may take 10 cycles on Fermi[/post], I should be able to multiply a 32-bit number left-shifted 32 bits by a 64-bit number with only two multiply instructions. The first is a simple umulhi with an add, but the second I’m having trouble with:

[codebox]device uint64_t mad_wide_u32(const unsigned int a, const unsigned int b, const unsigned int c) {

uint64_t res;

asm(“mad.wide.u32 %0, %1, %2, %3;” : “=r” (res) : “r” (a) , “r” (b), “r” ( c ));

return res;

}[/codebox]

Every time I try to compile with that, I get this error:

Assertion failure at line 2025 of …/…/be/cg/NVISA/cgtarget.cxx:

Compiler Error in file /tmp/tmpxft_00003f5c_00000000-7_appcu.cpp3.i during Code_Expansion phase:

ASM operand does not satisfy its constraint r

So, first, what am I doing wrong there? More importantly, what can I do about it?

Second, can someone please verify or fix the things in parentheses in the following statement of what I believe the mad.wide.u32 statement does? (Or anything else that’s wrong?)

Edit: Answered one of my own questions: from the PTX ISA PDF: “If .wide is specified, then d and c are twice as wide as a and b to receive the result of the multiplication.” So my statement below is modified.

“I think that mad.wide.u32 takes (2 cycles) to multiply a 32-bit number by another 32-bit number, produce a 64-bit result, and then add a 64-bit number to the entirety of the 64-bit result.”

Thanks!

Ken_g6 · October 10, 2010, 4:15pm

Hi, again. Every time I come back here, my questions get tougher. I have a program where I multiply numbers of various sizes against a 64-bit number, and get high bits of the result. I recently figured out that, while [post=“1111581”]__umul64hi may take 10 cycles on Fermi[/post], I should be able to multiply a 32-bit number left-shifted 32 bits by a 64-bit number with only two multiply instructions. The first is a simple umulhi with an add, but the second I’m having trouble with:

[codebox]device uint64_t mad_wide_u32(const unsigned int a, const unsigned int b, const unsigned int c) {

uint64_t res;

asm(“mad.wide.u32 %0, %1, %2, %3;” : “=r” (res) : “r” (a) , “r” (b), “r” ( c ));

return res;

}[/codebox]

Every time I try to compile with that, I get this error:

Assertion failure at line 2025 of …/…/be/cg/NVISA/cgtarget.cxx:

Compiler Error in file /tmp/tmpxft_00003f5c_00000000-7_appcu.cpp3.i during Code_Expansion phase:

ASM operand does not satisfy its constraint r

So, first, what am I doing wrong there? More importantly, what can I do about it?

Second, can someone please verify or fix the things in parentheses in the following statement of what I believe the mad.wide.u32 statement does? (Or anything else that’s wrong?)

Edit: Answered one of my own questions: from the PTX ISA PDF: “If .wide is specified, then d and c are twice as wide as a and b to receive the result of the multiplication.” So my statement below is modified.

“I think that mad.wide.u32 takes (2 cycles) to multiply a 32-bit number by another 32-bit number, produce a 64-bit result, and then add a 64-bit number to the entirety of the 64-bit result.”

Thanks!

Ken_g6 · October 10, 2010, 11:42pm

Aha! A little (lot) more Googling finally turned up the answer to the first problem. For a 64-bit operand, it appears I need to use the constraint “l” instead of “r”. :) Source: Implementation of Multiple-precision Modular Multiplication on GPU - Kaiyong Zhao (PDF)

So here’s my little mad.wide.u32 function that at least compiles:

[codebox]// Clobbers C to save two registers and to make the example I found work.

device void mad_wide_u32(const unsigned int a, const unsigned int b, uint64_t &c) {

asm(“mad.wide.u32 %0, %1, %2, %0;” : “+l” ( c ) : “r” ( a ) , “r” ( b ));

}[/codebox]

Whether it takes two cycles, one cycle, or four cycles, I don’t know. I’m going to assume it takes two unless someone tells me otherwise. :rolleyes:

Ken_g6 · October 10, 2010, 11:42pm

Aha! A little (lot) more Googling finally turned up the answer to the first problem. For a 64-bit operand, it appears I need to use the constraint “l” instead of “r”. :) Source: Implementation of Multiple-precision Modular Multiplication on GPU - Kaiyong Zhao (PDF)

So here’s my little mad.wide.u32 function that at least compiles:

[codebox]// Clobbers C to save two registers and to make the example I found work.

device void mad_wide_u32(const unsigned int a, const unsigned int b, uint64_t &c) {

asm(“mad.wide.u32 %0, %1, %2, %0;” : “+l” ( c ) : “r” ( a ) , “r” ( b ));

}[/codebox]

Whether it takes two cycles, one cycle, or four cycles, I don’t know. I’m going to assume it takes two unless someone tells me otherwise. :rolleyes:

Ken_Domino · October 11, 2010, 4:19pm

Slightly off topic, but I’ve never seen this inline PTX asm directive before, as I usually write the entire routine in PTX, then use the CUDA driver API to call it. Inline PTX (asm) is not documented in any of the docs in the CUDA GPU Computing Toolkit. Searching the forums led me to this: [url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtop...1666&hl=asm[/url], where it was recommended not to use it. That was about a year ago. I’m wondering if it’s a feature I can now use?

Ken_Domino · October 11, 2010, 4:19pm

Slightly off topic, but I’ve never seen this inline PTX asm directive before, as I usually write the entire routine in PTX, then use the CUDA driver API to call it. Inline PTX (asm) is not documented in any of the docs in the CUDA GPU Computing Toolkit. Searching the forums led me to this: [url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtop...1666&hl=asm[/url], where it was recommended not to use it. That was about a year ago. I’m wondering if it’s a feature I can now use?

wwa · October 12, 2010, 7:33pm

Wow, someone actually won the fight with the compiler. Nice!
Personally I gave up trying long ago and I basically code in ptx using perl as a preprocessor and cudasm as a backend.
Long live low-level programming! Compilers never know enough.

wwa · October 12, 2010, 7:33pm

Wow, someone actually won the fight with the compiler. Nice!
Personally I gave up trying long ago and I basically code in ptx using perl as a preprocessor and cudasm as a backend.
Long live low-level programming! Compilers never know enough.

Topic		Replies	Views
how to implement mul.wide.u32 in C code 32-bit multiplication and 64-bit registers CUDA Programming and Performance	4	2302	July 29, 2009
32-bit multiplication and 64-bit registers CUDA Programming and Performance	6	6106	December 10, 2008
long-integer multiplication: mul.wide.u64 and mul.wide.u128 CUDA Programming and Performance	31	7866	January 2, 2018
how to implement mul.wide.u32 in C code? 32-bit multiplication and 64-bit registers CUDA Programming and Performance	0	1703	July 28, 2009
32-bit number multiplication CUDA Programming and Performance	23	20619	July 1, 2012
how to implication mul.wide.u32 in C code? CUDA Programming and Performance	0	1728	July 27, 2009
umad and Array Indexing CUDA Programming and Performance	1	4877	April 29, 2009
Wrong result returned by madc.hi.u64 ptx instruction for specific operands CUDA NVCC Compiler cuda , ubuntu , nvbugs	5	763	December 16, 2021
Bug in compiler constant folding when using mul.wide.u16 CUDA Programming and Performance	2	421	December 20, 2017
Multiplication methods CUDA Programming and Performance	6	1531	December 20, 2013

PTX u32 wide multiplication How-to and performance characteristics?

Assertion failure at line 2025 of …/…/be/cg/NVISA/cgtarget.cxx:

Compiler Error in file /tmp/tmpxft_00003f5c_00000000-7_appcu.cpp3.i during Code_Expansion phase:

ASM operand does not satisfy its constraint r

Assertion failure at line 2025 of …/…/be/cg/NVISA/cgtarget.cxx:

Compiler Error in file /tmp/tmpxft_00003f5c_00000000-7_appcu.cpp3.i during Code_Expansion phase:

ASM operand does not satisfy its constraint r

Related topics