I am using CUDA runtime API in my application. I have a statement in my kernel which looks like this - X = a*X + b, where X, a, b are unsigned ints. When I checked the generated PTX, it was using mul.lo.u32 and add.u32 instead of mad.lo.u32 instruction. I tried inline assembly but it unnecessarily introduces lots of extra mov instructions.
can I force the nvcc to use mad instruction (compiler directives?)
can I modify the PTX and update the executable? if so how? (without using driver API)
The central graphics processors do not execute the instruction set described by PTX (although probably a similar one), but there is another compilation step that can perform certain optimizations when creating the byte code for a certain GPU processor. There is not too much revealed, but ptx_isa_2.2.pdf states explicitly in Table 54 about FLOATING POINT operations: Â»In particular, mul/add and mul/sub sequences with no rounding modifiers may be optimized to use fused-multiply-add instructions on the target device.Â« That might be true for integer mad, too, as rounding mode is never a problem here.
AFAIK you can modify the PTX code, the nvcc.pdf gives an overview of the whole compilation process. It is, however, uncertain if this actually changes the resulting instruction stream.
The optimization of mul + add into mad happens after the PTX stage. If you want to see it, you have to disassemble the .cubin file with one of the available disassemblers (decuda (for compute capability 1.x only) or nv50dis/nvc0dis together with elfToCubin, or the official cuobjdump (cc 1.x only unless you have the 4.0rc prerelease).