Assembly Optimization

Hello,

When I generate the ptx code from mu *.cu source, I got some instructions blocks like that:

sub.s32 	%r89, %r41, 1;	   	// 

	mov.s32 	%r90, %r89;			  // 

	mov.s32 	%r91, %r90;			  // 

	sub.s32 	%r92, %r43, 2;	   	// 

	max.s32 	%r93, %r92, %r91;		// 

	mov.s32 	%r43, %r93;			  //

This code could be more optimized , simply like that:

sub.s32 	%r89, %r41, 1;	   	// 

	sub.s32 	%r92, %r43, 2;	   	// 

	max.s32 	%r43, %r92, %r89;		//

I believe the hardware is capable of doing this kind of optimization on the fly, with low overhead. But i’m afraid that my algorithm is loosing performance because the asm is not optimized.

Questions:

Is there any option that makes more optimizations? (I’m using -O3 already.)

Is there any concept that I’m missing about the assembly generation? (maybe an optimization step after --ptx generation)

Is there any way of inlining asm code into the c code? (i have already tried asm("…;") directive, but it does not resolves C variable to PTX register)

Thanks in advance.

ptx is not assembly. Register reuse is being done when ptx gets translated to binary (.cubin), which happens either when you compile with a specific setting or otherwise (if you compile to ptx) it’s done by the driver @ runtime.

ptxes are supposed to be like that, this is pre-optimization.

I’m not an expert but I have played a moderate amount with PTX.
Is there any option that makes more optimizations? I think that is about it. The compiler is smart but not that that smart and ptx code can sometimes be improved by hand tuning. One positive outcome of keeping the compiler code is that it produces code that will work in all instances(different processors, all data combinations, exc…) whereas hand coding my only work for positive numbers or something like that.

Also when writing hand code there are also a few gotchas. The compiler usually avoids immediate read-after-writes and also avoids the use of some really slow ptx instructions. (EX: about ~800 ticks for set.eq.and.f32.f32 f32, f32, f32, !Pdt; however about 60 ticks for set.eq.and.f32.f32 f32, f32, f32, Pdt; ) I find that hand coding sometimes “looks” faster but isn’t. It’s always best to time your code using clock(). Make sure to clock it on the GPU and not in device emulation.

Is there any way of inlining asm code into the c code? Not at this time. There are several posts on this to look at. You can compile your cuda file to a PTX file and then edit that.

Hope this helps.