I am looking through the CUDA PTX docs and I see that there is a mad instruction which multiplies two numbers together and adds in a third.
In my code, I access a uint64 (long long int) array which results in PTX code like the following:
mul.lo.u64 %rd11, %rd9, 8; //
add.u64 %rd12, %rd3, %rd11; //
Where the pointer is in %rd3, and the offset is in %rd11. Wouldn’t this be better off done in a single mad instruction?
Hm, you would think so.
One thing to consider is that the generated PTX is not final machine code. Decuda may produce something different. PTX has an instruction for integer modulus operation, but the hardware has no such operation, meaning that somewhere along the way it translates into dozens of instructions, either in C to PTX or PTX to cubin.
It would be useful to have a list of operations that are actually implemented on the various devices, and their latency and throughput. (Even if mad for u64 existed, it might be slower than shift and add.) Such a list may exist, but I’ve never seen it.