enforcing mad24 instructions any way to force the compiler to fuse subsequent mul + add into mad24


I wonder if someone else faced with this problem previously…

perhaps this is a bit esoteric at first glance…
but if one extensively uses integer arithmetic in the kernel and wants to squeeze out some extra flops

by default the compiler optimizes out multiplications by the powers of 2 with shifts which might not
always be desireable, consider the example:

__umul24(a, 2) + b wich results in 2 operations
while __umul24(a, 3) + b fuses into a single mad24 instruction… grr

isn’t there any way to force the compiler to use mad24 instead ?

although the “heavy weapon” would be to add mad24 intrinsic manually to nvopencc sources… but this is a headache


My first thought would be to put the power of 2 in a variable…

Indeed! how stupid I am ;)

this definetely works for constant and shared memory …

as for registers, one has to fool the optimizer to keep it from

substituting register with its value,

for instance as follows (if you need lots of power-of-two muls + adds and can sacrifice 1 register for that):

volatile unsigned x = (threadIdx.x >> 16) + 2; // voila! compiler is unable to simplify this

y = __umul24(z, x) + w; // gets replaced by mad24 !