When I generate the ptx code from mu *.cu source, I got some instructions blocks like that:
sub.s32 %r89, %r41, 1; // mov.s32 %r90, %r89; // mov.s32 %r91, %r90; // sub.s32 %r92, %r43, 2; // max.s32 %r93, %r92, %r91; // mov.s32 %r43, %r93; //
This code could be more optimized , simply like that:
sub.s32 %r89, %r41, 1; // sub.s32 %r92, %r43, 2; // max.s32 %r43, %r92, %r89; //
I believe the hardware is capable of doing this kind of optimization on the fly, with low overhead. But i’m afraid that my algorithm is loosing performance because the asm is not optimized.
Is there any option that makes more optimizations? (I’m using -O3 already.)
Is there any concept that I’m missing about the assembly generation? (maybe an optimization step after --ptx generation)
Is there any way of inlining asm code into the c code? (i have already tried asm("…;") directive, but it does not resolves C variable to PTX register)
Thanks in advance.