PTX assembly language reference does one exist, or plans to release one?

Are there plans to release a PTX assembly language manual? Or, does one exist somewhere?

It would be helpful for general debugging… Also, I can see writing assembly to hand optimize register use, and so forth.

That said, would it be possible to allow “inlined” assembly in Cuda code? As device functions, perhaps? (I seem to recall some old PC C compilers – maybe Turbo C? – allowed something like this.)

I believe the .ptx file is pre-optimized code, not sure that hand optimizing the register usage there would have an impact.
(e.g., I wrote some code that used 100 registers, but the final .cubin file says it only used 21)

Stewie is correct. PTX is a virtual machine assembly language, so it does not optimize for exact hardware instruction set or register usage. This way we can target different GPUs with the same assembly-level interface. The code is then further optimized at load time by the GPU driver.


Interesting… good to know.

Still, a PTX cheat sheet would be nice to have (e.g., for debugging).

For example: does mul.lo.u32 multiply (unsigned) the low 16 bits of the operands together into an 32-bit word?

In this snippet, which is generating an index into global memory array of ints, parm_output. The index is threadIdx.x + blockIdx.x * blockDim.x.

$r15 gets the product blockIdx.x * blockDim.x using mul.lo.u32, which makes sense because the block dimensions are limited to 65535.

$r16 gets the final index, adding threadIdx.x to the product above. This value could very well be larger than 2^16.

$r17 gets the offset into global memory, again using mul.lo.u32 to multiply the index ($r16) by sizeof(int) (4).

But since $r16 could be greater than 2^16, is mul.lo.u32 appropriate? Or do I misunderstand mul.lo.u32?

# parm_output[threadIdx.x + blockIdx.x * blockDim.x] = intval;

ld.param.u32  $r11, %parm_output;	#  id:30 %parm_output+0x0

cvt.u32.u16  $r12, %tid.x;    	#  

cvt.u32.u16  $r13, %ctaid.x;  	#  

cvt.u32.u16  $r14, %ntid.x;    #  

mul.lo.u32  $r15, $r13, $r14;  #  blockIdx.x * blockDim.x

add.u32  $r16, $r12, $r15;    	#  threadIdx.x + (blockIdx.x * blockDim.x)

mul.lo.u32  $r17, $r16, 4;    	#  ... * sizeof(int)  ???

add.u32  $r18, $r11, $r17;    	#  [$r18+0], $r10;	#  id:34

I think the .lo. is not an indicator for the number of bits used. For example the __mul24 function is translated into mul24.lo.s32 opcode.


Hence, the need for a cheatsheet… :P

See PTX: Parallel Thread Execution, ISA Version 1.1

and PTX: Parallel Thread Execution, ISA Version 1.2.