PTX assembly language reference does one exist, or plans to release one?

jhanweck · March 25, 2007, 1:56am

Are there plans to release a PTX assembly language manual? Or, does one exist somewhere?

It would be helpful for general debugging… Also, I can see writing assembly to hand optimize register use, and so forth.

That said, would it be possible to allow “inlined” assembly in Cuda code? As device functions, perhaps? (I seem to recall some old PC C compilers – maybe Turbo C? – allowed something like this.)

Stewie · March 26, 2007, 3:08pm

I believe the .ptx file is pre-optimized code, not sure that hand optimizing the register usage there would have an impact.
(e.g., I wrote some code that used 100 registers, but the final .cubin file says it only used 21)

Mark_Harris · March 27, 2007, 1:11pm

Stewie is correct. PTX is a virtual machine assembly language, so it does not optimize for exact hardware instruction set or register usage. This way we can target different GPUs with the same assembly-level interface. The code is then further optimized at load time by the GPU driver.

Mark

jhanweck · March 28, 2007, 2:45pm

Interesting… good to know.

Still, a PTX cheat sheet would be nice to have (e.g., for debugging).

For example: does mul.lo.u32 multiply (unsigned) the low 16 bits of the operands together into an 32-bit word?

In this snippet, which is generating an index into global memory array of ints, parm_output. The index is threadIdx.x + blockIdx.x * blockDim.x.

$r15 gets the product blockIdx.x * blockDim.x using mul.lo.u32, which makes sense because the block dimensions are limited to 65535.

$r16 gets the final index, adding threadIdx.x to the product above. This value could very well be larger than 2^16.

$r17 gets the offset into global memory, again using mul.lo.u32 to multiply the index ($r16) by sizeof(int) (4).

But since $r16 could be greater than 2^16, is mul.lo.u32 appropriate? Or do I misunderstand mul.lo.u32?

# parm_output[threadIdx.x + blockIdx.x * blockDim.x] = intval;

ld.param.u32  $r11, %parm_output;	#  id:30 %parm_output+0x0

cvt.u32.u16  $r12, %tid.x;    	#  

cvt.u32.u16  $r13, %ctaid.x;  	#  

cvt.u32.u16  $r14, %ntid.x;    #  

mul.lo.u32  $r15, $r13, $r14;  #  blockIdx.x * blockDim.x

add.u32  $r16, $r12, $r15;    	#  threadIdx.x + (blockIdx.x * blockDim.x)

mul.lo.u32  $r17, $r16, 4;    	#  ... * sizeof(int)  ???

add.u32  $r18, $r11, $r17;    	#  

st.global.s32  [$r18+0], $r10;	#  id:34

prkipfer · March 29, 2007, 10:29am

I think the .lo. is not an indicator for the number of bits used. For example the __mul24 function is translated into mul24.lo.s32 opcode.

Peter

jhanweck · March 29, 2007, 1:56pm

Hence, the need for a cheatsheet… :P

Zvi · March 29, 2009, 10:25am

See PTX: Parallel Thread Execution, ISA Version 1.1

and PTX: Parallel Thread Execution, ISA Version 1.2.

Topic		Replies	Views
ptxas optimization CUDA Programming and Performance	4	2894	January 9, 2009
how to implement mul.wide.u32 in C code 32-bit multiplication and 64-bit registers CUDA Programming and Performance	4	2286	July 29, 2009
Programming the PTX virtual machine resolved many high-level issues posted CUDA Programming and Performance	10	6046	August 29, 2007
Assembly Optimization CUDA Programming and Performance	2	4362	May 25, 2009
Programming CUDA at 'assembler' level? CUDA Programming and Performance	9	13495	November 7, 2010
Output code uses far too many registers CUDA Programming and Performance	8	1605	February 5, 2017
ptxas register use CUDA Programming and Performance	5	1774	March 4, 2014
Feedback on PTXAS CUDA Programming and Performance	0	1395	January 14, 2008
ptx consistency & documentation CUDA Programming and Performance	7	11763	January 11, 2008
PTX in binary ? CUDA Programming and Performance	9	7797	June 20, 2011

PTX assembly language reference does one exist, or plans to release one?

Related topics