XMAD meaning

SPWorley · March 11, 2017, 6:48am

I spent an hour or so trying to coax C or ptx to generate that 2-XMAD u16 x u32 multiply in output SASS.

No success.

Observations: using easy pure CUDA C, you’ll get pure optimal single XMAD 16x16->16 bit multiplies by multiplying unsigned shorts. But that won’t recognize a .h1 high word opportunity when you >>16 the short initializer. I guess that is asking too much for ptxas to recognize, though it DOES recognize .h0 and .h1 opportunities using the IEEE paper u16 local variable assignment incantation.

C code with an 16 bit immediate, like d=a12345+b, was translated into PTX mad.lo.s32 with an immediate as expected. And, happily, this was converted into the desired XMAD and XMAD.PSL pair in SASS!! So ptxas does know how to properly generate optimal 2-XMAD u16u32 when it knows it’s u16*u32.

ptxas acts very straightforwardly translating PTX to SASS, generating an optimal 1, 2, or 3 XMAD result for each ptx mul or mad, given the information the single ptx line conveys. But PTX itself is not descriptive enough to annotate the extra information of “one register argument is 16 bit, one is 32 bit” in mul or mad. ptxas is able to notice the u16 itself when given an 16 bit immediate (and generates optimal 2-XMAD SASS), but not with a u32 register argument that is clearly holding only a u16 value by initialization.

So, Scott, you were correct. We can’t do this in PTX since PTX isn’t descriptive enough and ptxas doesn’t try to value track multiple ptx statements to understand when an argument is 16 bit. The significant flaw in my hypothesis is that ptxas DOES successfully analyze multiple lines to notice the u32->u16 PTX local variables from the IEEE paper method to track .h0 and .h1 opportunities.

Topic		Replies	Views
Generating XMAD{.X,.CC} by PTX CUDA Programming and Performance	4	1082	February 12, 2019
long-integer multiplication: mul.wide.u64 and mul.wide.u128 CUDA Programming and Performance	31	7662	January 2, 2018
Bytes manipulation in PTX CUDA Programming and Performance hw , cuda , kernel	23	2974	March 2, 2023
Integer MAD instruction CUDA Programming and Performance	11	17664	October 22, 2010
"no instruction" stalls every 256 bytes of the binary code CUDA Programming and Performance	7	1554	February 14, 2019
A more accurate, performance-competitive implementation of expf() CUDA Programming and Performance	24	8233	November 19, 2017
Bug with integer division? CUDA Programming and Performance	33	9351	September 9, 2015
cuda for ati cards we need a stadard CUDA Programming and Performance	27	43374	October 3, 2008
Bitslice-DES optimization CUDA Programming and Performance	55	12607	January 29, 2022
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26918	December 23, 2014

XMAD meaning

Related topics