mullo and mulhi instructions

These two multiply instructions are usually architecture dependent. How are they implemented in CUDA?

mullo(a,b) -> |_ (a x b) / 2^W _|
mulhi(a,b) -> (a x b) mod 2^W

where W is the processor’s word size, so 2 elevated to the power of W. And, |_ _| depicts floor function.

In fact, I am having trouble understanding the concept of mullo and mulhi. Can someone point out a place where can I read more about these type of instructions?

CUDA follows C/C++ semantics. If you multiply two n-bit numbers, the product occupies 2n bits. mul.lo returns the least-significant n bits, and mul.hi returns the most-significant n bits of the double-width product. Try this (it may also be interesting to examine the resulting PTX code):

unsigned int a, b, prod_hi, prod_lo;
unsigned long long int prod;
a = 0x31415926;
b = 0x53589793;
prod_lo = a * b;
prod_hi = __umulhi(a, b);
prod = (unsigned long long int)a * b;
printf ("prod_hi_lo = %08x_%08x  prod=%016llx\n", prod_hi, prod_lo, prod);

Thanks! I will check the PTX code for sure.

Come to think of it, you may have to make ‘a’ and ‘b’ kernel arguments to see anything interesting. Assigning ‘a’ and ‘b’ literal constants allows the compiler to evaluate everything at compile time.