PTX in binary ?

Hello,

As far as I know PTX files are always in text.

Is there a binary version of PTX (so still virtual, but also in binary) ?

Bye,
Skybuck.

AFAIK, machine code is binary and is stored in cubin files which can be embedded in executables. I’ve never heard of any binary representation of PTX. Though, AMD has AMDIL and OpenCL IR which are binary.

What is a cubin ? Is this the text ptx instructions/assembly compiled to the virtual binary version ? Or is this further compiled into final “secret gpu micro-instructions” ?

This is important to know because it is claimed by nvidia that the ptx versions will be able to be compiled to new/future hardware, while gpu-specific compiled versions might not.

It is not clear that there is a virtual binary instruction set. PTX is an assembly language for a GPU-like virtual machine, but I don’t know if CUDA is like Java and has a compact machine code for the same virtual machine.

If you read the CUDA Programming Guide, section 3.1, you will see that NVIDIA indicates that cubins are architecture-specific. It is only the PTX format that can be JIT compiled to newer architectures.

I find this PTX text format a bit risky… what if there are text-based buffer overruns in whatever is compiling the PTX ?!?

Perhaps PTX in “virtual binary” would offer some more protection.

There could also be other advantages:

  1. If the virtual instruction set would be “fixed-size” meaning all instructions have the same size, then perhaps interesting things can be done like easily generate new PTX or make PTX self modifieing and perhaps recompile it or so…

  2. Offer more execution/startup speed. Instead of having to call/invoke a potentially slow “PTX” parser it could instead use a fast “PTX” reader.

  3. Perhaps more protection against buffer overruns, so perhaps more security.

  4. Perhaps also more difficult to reverse engineer ?!? Though this would not be it’s main aim.

  5. Perhaps PTX virtual binary is more compact, saving some space/bytes.

  6. Less information leaked about code behind PTX. (There seems to be quite some useless text in PTX ?!? like file folders and file names… which are probably not used ??? or maybe for debugging or analysis ?)

  7. Perhaps easier to develop tools to run through PTX, no parsers needed.

  8. Perhaps easier to emulate PTX on other machines, even virtual machines. Anybody could write a PTX parser, so just having PTX in text doesn’t stop people from making their own PTX binary format… nvidia might as well take the lead and do it before anybody else does it, just to be able to control it somewhat and set the standard.

There are ofcourse some risks involved with a binary version… more difficult to make it flexible towards the future, text is easily changed/manipulated, parsers can be changed, with binary this becomes a bit more difficult but not impossible.

A simple start could be:
PTX_VERSION unsigned long
PTX_etc

This should offer quite some version possibilities ! External Image :)

@Skybuck

I totally agree with you.

On another note, you can check out GPU ocelot and GPU.NET. The guys behind those projects probably have what you are looking for.

I find this a very weak argument for a binary format. Some of your other arguments are more compelling, though.

Please, please no self-modifying code! :)

This does make some sense, since NVIDIA is now committed to including a full PTX compiler inside the graphics driver (not just in the CUDA toolkit). However, it may turn out that parsing the text-based PTX format is not a significant amount of time in the compilation process. If that is true, then having both the text and binary format is a maintenance overhead without much win.

There are better ways to prevent buffer overruns in text processing than inventing a new data format.

This is a genuine concern that has been expressed by several CUDA developers. If you are worried about someone being able to reverse-engineer your CUDA code from the embedded text-based PTX format, then currently you have to ship your code with cubins only. In that case, you have to make sure you compile for every possible architecture your users could have, and you give up future compatibility with new architectures until you roll out a new build. Including PTX in your binary is the only forward-compatible option. A binary PTX format would increase obfuscation for who want that, while still preserving forward compatibility.

I think this is only a win if you have a very, very large kernel.

Reducing the difficulty in writing a PTX parser is nice, although Ocelot has already done the hard work for everyone, as hqneuron mentioned. They have an open-source PTX parser in order to do PTX to PTX translation for various research purposes. It is a very neat project, and I would highly recommend you take a look.

Here was an add-er which was optimized away, so this is now an empty kernel, it’s already 4 KB !

.version 1.4
.target sm_10, map_f64_to_f32
// compiled with C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../open64/lib//be.exe
// nvopencc 4.0 built on 2011-05-13

//-----------------------------------------------------------
// Compiling C:/Users/Skybuck/AppData/Local/Temp/tmpxft_000017b0_00000000-11_adder.cpp3.i (C:/Users/Skybuck/AppData/Local/Temp/ccBI#.a04264)
//-----------------------------------------------------------

//-----------------------------------------------------------
// Options:
//-----------------------------------------------------------
//  Target:ptx, ISA:sm_10, Endian:little, Pointer Size:64
//  -O3	(Optimization level)
//  -g0	(Debug level)
//  -m2	(Report advisories)
//-----------------------------------------------------------

.file	1	"C:/Users/Skybuck/AppData/Local/Temp/tmpxft_000017b0_00000000-10_adder.cudafe2.gpu"
.file	2	"c:\tools\microsoft visual studio 10.0\vc\include\codeanalysis\sourceannotations.h"
.file	3	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\crt/device_runtime.h"
.file	4	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\host_defines.h"
.file	5	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\builtin_types.h"
.file	6	"c:\tools\cuda\toolkit 4.0\v4.0\include\device_types.h"
.file	7	"c:\tools\cuda\toolkit 4.0\v4.0\include\driver_types.h"
.file	8	"c:\tools\cuda\toolkit 4.0\v4.0\include\surface_types.h"
.file	9	"c:\tools\cuda\toolkit 4.0\v4.0\include\texture_types.h"
.file	10	"c:\tools\cuda\toolkit 4.0\v4.0\include\vector_types.h"
.file	11	"c:\tools\cuda\toolkit 4.0\v4.0\include\builtin_types.h"
.file	12	"c:\tools\cuda\toolkit 4.0\v4.0\include\host_defines.h"
.file	13	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\device_launch_parameters.h"
.file	14	"c:\tools\cuda\toolkit 4.0\v4.0\include\crt\storage_class.h"
.file	15	"C:\Tools\Microsoft Visual Studio 10.0\VC\bin/../../VC/INCLUDE\time.h"
.file	16	"O:/CUDA C/test add-er/version 0.01/adder.cu"
.file	17	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\common_functions.h"
.file	18	"c:\tools\cuda\toolkit 4.0\v4.0\include\math_functions.h"
.file	19	"c:\tools\cuda\toolkit 4.0\v4.0\include\math_constants.h"
.file	20	"c:\tools\cuda\toolkit 4.0\v4.0\include\device_functions.h"
.file	21	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_11_atomic_functions.h"
.file	22	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_12_atomic_functions.h"
.file	23	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_13_double_functions.h"
.file	24	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_20_atomic_functions.h"
.file	25	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_20_intrinsics.h"
.file	26	"c:\tools\cuda\toolkit 4.0\v4.0\include\surface_functions.h"
.file	27	"c:\tools\cuda\toolkit 4.0\v4.0\include\texture_fetch_functions.h"
.file	28	"c:\tools\cuda\toolkit 4.0\v4.0\include\math_functions_dbl_ptx1.h"


.entry _Z6kernelii (
	.param .s32 __cudaparm__Z6kernelii_a,
	.param .s32 __cudaparm__Z6kernelii_b)
{
.loc	16	1	0

$LDWbegin__Z6kernelii:
.loc 16 6 0
exit;
$LDWend__Z6kernelii:
} // _Z6kernelii

That’s a whole lot of waste/text/bytes for nothing ?!?

10 of these very little kernels would already require 40 kilobytes that’s quite a lot for very little !

Is all this text really necessary ?!?

I could be wrong but I think entire C64 games were written with just 64 KB ?!?

If 200 kB of PTX is a serious burden for your application, the you probably are not running on a platform that supports CUDA in the first place. Once we are running CUDA on 16 bit microcontrollers, this will be a problem. :)

Think of “internet”.

I would like to make small applications so they can easily be downloaded…

300 KB to 1 MB is acceptable, but beyond that it gets a bit annoying, especially if it’s just “waste” like the .file sections ? It seems waste ?