Assembly Optimization

Edans_Sandes · May 24, 2009, 2:44pm

Hello,

When I generate the ptx code from mu *.cu source, I got some instructions blocks like that:

sub.s32 	%r89, %r41, 1;	   	// 

	mov.s32 	%r90, %r89;			  // 

	mov.s32 	%r91, %r90;			  // 

	sub.s32 	%r92, %r43, 2;	   	// 

	max.s32 	%r93, %r92, %r91;		// 

	mov.s32 	%r43, %r93;			  //

This code could be more optimized , simply like that:

sub.s32 	%r89, %r41, 1;	   	// 

	sub.s32 	%r92, %r43, 2;	   	// 

	max.s32 	%r43, %r92, %r89;		//

I believe the hardware is capable of doing this kind of optimization on the fly, with low overhead. But i’m afraid that my algorithm is loosing performance because the asm is not optimized.

Questions:

Is there any option that makes more optimizations? (I’m using -O3 already.)

Is there any concept that I’m missing about the assembly generation? (maybe an optimization step after --ptx generation)

Is there any way of inlining asm code into the c code? (i have already tried asm(“…;”) directive, but it does not resolves C variable to PTX register)

Thanks in advance.

_Big_Mac · May 24, 2009, 3:27pm

ptx is not assembly. Register reuse is being done when ptx gets translated to binary (.cubin), which happens either when you compile with a specific setting or otherwise (if you compile to ptx) it’s done by the driver @ runtime.

ptxes are supposed to be like that, this is pre-optimization.

sunsetquest · May 25, 2009, 4:48pm

I’m not an expert but I have played a moderate amount with PTX.
Is there any option that makes more optimizations? I think that is about it. The compiler is smart but not that that smart and ptx code can sometimes be improved by hand tuning. One positive outcome of keeping the compiler code is that it produces code that will work in all instances(different processors, all data combinations, excâ€¦) whereas hand coding my only work for positive numbers or something like that.

Also when writing hand code there are also a few gotchas. The compiler usually avoids immediate read-after-writes and also avoids the use of some really slow ptx instructions. (EX: about ~800 ticks for set.eq.and.f32.f32 f32, f32, f32, !Pdt; however about 60 ticks for set.eq.and.f32.f32 f32, f32, f32, Pdt; ) I find that hand coding sometimes â€œlooksâ€ faster but isn’t. Itâ€™s always best to time your code using clock(). Make sure to clock it on the GPU and not in device emulation.

Is there any way of inlining asm code into the c code? Not at this time. There are several posts on this to look at. You can compile your cuda file to a PTX file and then edit that.

Hope this helps.

Topic		Replies	Views
PTX instructions are reordered CUDA Programming and Performance	12	1469	May 13, 2024
asm inlining in CUDA code? CUDA Programming and Performance	5	6460	July 19, 2010
PTX assembly language reference does one exist, or plans to release one? CUDA Programming and Performance	6	6784	March 29, 2009
ptxas register use CUDA Programming and Performance	5	1751	March 4, 2014
Programming CUDA at 'assembler' level? CUDA Programming and Performance	9	13469	November 7, 2010
linking hand-coded PTX CUDA Programming and Performance	4	4414	August 31, 2007
preventing ptxas from reordering instructions CUDA Programming and Performance	23	6077	December 2, 2022
ptxas compiles my program wrong CUDA 4.0RC2 CUDA Programming and Performance	2	4475	May 8, 2011
Assembly/Machine code for gpus? Is it possible? CUDA Programming and Performance	2	4449	November 3, 2009
ptxas optimization CUDA Programming and Performance	4	2875	January 9, 2009

Assembly Optimization

Related topics