Integer multiplication How to force nvcc to generate mul.hi?

Hello all,

I need to perform integer multiplication and only interested in high-order 32 bits of result. Is there any way to force nvcc to generate mul.hi.u32 instead of cvt.u64.u32/mul.lo.u64/shr.u64? Or maybe latter performs faster?

And one more question: what is native register size in G80/G84? Are 64-bit arithmetic and logical instrictions (no FP) slower than 32-bit ones?

Hope someone knows answers :)

Thanks in advance.

Ed: Okay, functions which map to mul.hi are __mulhi() and __umulhi() defined in device_functions.h. It is also listed in Appendix B.2 of CUDA Programming Guide. Question about 64-bit ALU performance still remains.

On G8x both mul.lo.32 and mul.hi.32 instruction have throughput of 16 clocks per warp, while mul24.lo and mul24.hi are faster, taking only 4 clocks per warp, just as any other ‘simple’ operation.