I read in the PTX document from NVIDIA that “PTX is a low-level parallel thread execution virtual machine (VM) and
virtual instruction set architecture (ISA)”. Can you guys elaboarte this concept in a little bit detail?


Which part?

A virtual instruction set architecture is simply an instruction set that can be compiled to, but requires additional transformations before it can be executed by hardware. The idea is that you can do most of the compiler optimizations on the virtual ISA and then target many different processors by doing simple instruction to instruction translations.

Probably the most widespread use of this design philosophy is Java; bytecode is a virtual instruction set assumed to be run on a virtual stack based machine.

The line is actually very blurry between virtual instruction sets and traditional ISAs, as many designs actually treat x86 as a virtual ISA because it is so widespread. In the early 90s Intel added a hardware decoder that translates x86 into RISC-like instructions. The processor fetches x86 instruction, hardware translates them into RISC instructions, and the processor executes the RISC instructions rather than x86. In this case, you can think of x86 as a virtual instruction set because it is not executed directly by the processor. This approach was later augmented in the mid-late 90s for x86 by Transmeta and DEC to do the decoding in software rather than hardware. Transmeta built a VLIW processor and DEC built a RISC processor; both of them used just in time compilers to translate programs from x86 to their native ISA immediately before they were executed.

Around 2003-2004 the LLVM compiler project created a virtual ISA that differed from most other virtual ISAs by adopting a philosophy that you could retain a significant amount of information from the high level language in the instruction set that would make it easier to do compiler optimizations on an already compiled binary. Rather than having registers treated as bits, they explicitly type all registers and instructions (int,char,float,etc), always store registers in SSA form, build memory allocation and exception handling semantics into the language, and add support for complex data structures (structs/unions) in the ISA. The point is that translating from this virtual ISA is just as easy as Java bytecode or x86, but if you want to re-optimize the code, it is dramatically easier.

I view PTX as an extension to LLVM with explicit support for parallelism, if you look at the actual instructions they are very similar. PTX throws away support for memory allocation and exception handling in the ISA because they do not apply to GPUs at this point. LLVM assumes a single thread of execution whereas PTX assumes the CUDA single program multiple thread model: many threads execute the same instructions and take different control flow paths based on input data.

I think that NVIDIA uses PTX because they want to be able to change the architecture of their GPUs and still support legacy applications. To accomplish this, every CUDA program is first compiled into PTX and then translated to the native ISA of a particular GPU by the driver.