I spent an hour or so trying to coax C or ptx to generate that 2-XMAD u16 x u32 multiply in output SASS.
No success.
Observations: using easy pure CUDA C, you’ll get pure optimal single XMAD 16x16->16 bit multiplies by multiplying unsigned shorts. But that won’t recognize a .h1 high word opportunity when you >>16 the short initializer. I guess that is asking too much for ptxas to recognize, though it DOES recognize .h0 and .h1 opportunities using the IEEE paper u16 local variable assignment incantation.
C code with an 16 bit immediate, like d=a12345+b, was translated into PTX mad.lo.s32 with an immediate as expected. And, happily, this was converted into the desired XMAD and XMAD.PSL pair in SASS!! So ptxas does know how to properly generate optimal 2-XMAD u16u32 when it knows it’s u16*u32.
ptxas acts very straightforwardly translating PTX to SASS, generating an optimal 1, 2, or 3 XMAD result for each ptx mul or mad, given the information the single ptx line conveys. But PTX itself is not descriptive enough to annotate the extra information of “one register argument is 16 bit, one is 32 bit” in mul or mad. ptxas is able to notice the u16 itself when given an 16 bit immediate (and generates optimal 2-XMAD SASS), but not with a u32 register argument that is clearly holding only a u16 value by initialization.
So, Scott, you were correct. We can’t do this in PTX since PTX isn’t descriptive enough and ptxas doesn’t try to value track multiple ptx statements to understand when an argument is 16 bit. The significant flaw in my hypothesis is that ptxas DOES successfully analyze multiple lines to notice the u32->u16 PTX local variables from the IEEE paper method to track .h0 and .h1 opportunities.