I need to perform integer multiplication and only interested in high-order 32 bits of result. Is there any way to force nvcc to generate mul.hi.u32 instead of cvt.u64.u32/mul.lo.u64/shr.u64? Or maybe latter performs faster?
And one more question: what is native register size in G80/G84? Are 64-bit arithmetic and logical instrictions (no FP) slower than 32-bit ones?
Hope someone knows answers :)
Thanks in advance.
Ed: Okay, functions which map to mul.hi are __mulhi() and __umulhi() defined in device_functions.h. It is also listed in Appendix B.2 of CUDA Programming Guide. Question about 64-bit ALU performance still remains.