Going to learn PTX and write a GPU compiler

Hi again :)

Basically this is not a new topic, but I think learning PTX and writing a GPU compiler is far from “CUDA for Delphi?” so it would be better to continue about that here :)

old topic: http://forums.nvidia.com/index.php?showtop…20&start=20

last two posts:

E.D. Riedijk:


Explain once again what are you trying to achieve. What is “to make a PTX code run from an EXE” and how this relates to GPU compiler.
Are you generating PTX code on-the-fly? What is the problem with calling ptxas then?

Ok, I’ll try to explain step-by-step.

The final goal is to write a standalone GPU compiler, which supports Pascal, C and PTX on-the-fly, like


int max(a,b)



.reg .s32 res;

max.s32 res, a, b; 


return res; //variables declared in any language are visible in others.



function  gpu_sqrt(a:single):single;




but for this I need to know

  1. coding in PTX as well as possible

  2. how C and PASCAL compilers work

  3. theoretically how a compiler is being written

  4. start developing a mixed-syntax compiler (output will be PTX code of course) // this will take a veeeery long time, maybe years

  5. create a basic IDE

At this time I cannot even start the first task, because I don’t know how to make that virtual machine take my code and execute, is the “binary_map” generated by bin2c what I need? don’t think so, it just takes a file and converts each byte to ASCII hex code

From what I understand, PTX is still an intermediate language which is further optimized and then converted to machine (GPU native) code. A good starting goal might be to just get your compiler working where it can inline PTX instructions with the C/Pascal/etc. code and convert the whole thing to a PTX file that could be fed to the nvidia assembler or cudasm. Or, with CUDA 2.1, I believe there is a driver call to just execute PTX (it handles the conversion to native code for you), so you could write a little side program to test your PTX files by invoking that routine.

Since I’ve put a bit of thought into doing this very same project myself, I’ll say:

  • Read the PTX documentation that comes with the CUDA toolkit very carefully
  • Download cudasm and decuda (these will help you compile PTX and disassemble cubin files to PTX to examine the code)
  • Get yourself a good compilers book or two and study them to figure out what you need to do

The rest (writing the actual compiler, creating the IDE, etc.) will follow from there.

Hmm, as I undersood I should call some driver functions to make a PTX file works, instead of compiling via ptxas.exe which generates an useless .bin file with contents like this

architecture {sm_10}

abiversion   {1}

modname	  {cubin}

code {

	name = __dummy_entry__

	lmem = 0

	smem = 0

	reg  = 0

	bar  = 0

	bincode {

		0xf0000001 0xe0000001 



What about PTX documentation, I’ve read it of course :D, but I’ll read it again :)

What options are you using to compile the file? It should be generating a binary .cubin file, not a text file like you posted.

EDIT: You can also check the CUDA 2.1 programming guide to find the call for the driver API that lets you execute a PTX string.


Usage  : ptxas [options] <ptx file>,...



--debug-info <String>			   (-debug-info)

		Specify name of file into which the DWARF information held by the parsed PTX files must be written.

--dont-merge-basicblocks			(-no-bb-merge)

		Normally, ptxas attempts to merge consecutive basic blocks as part of its optization process. However, for debuggable code this is very confusing. This option prevents basic block merging, at a slight perfomance cost.

--entry <entry function>,...		(-e)

		Entry function name.

--force-externals				   (-fext)

		Used in debug compilation flow: generate device shadow variables in host address space as externals, as opposed to statics.

--gpu-name <gpu name>			   (-arch)

		Specify name of nVidia gpu to generate code for. This option also takes virtual compute architectures, in which case code generation is suppressed. This can be used for parsing only, or in combination with option --debug-info for extracting dwarf information (this mode is used by nvcc). Allowed values for this option:  'compute_10','compute_11','compute_12','compute_13','sm_10','sm_11','sm_12','sm_13'. Default value:  'sm_10'.

--help  (-h)

		Print this help information on this tool.

--input-as-string <ptx string>,...  (-ias)

		This option allows ptx modules to be passed directly as strings instead of via files. It can be used for simple runtime support, or when it is somehow not desired to pass the ptx string via the file system.

--key <string>					  (-k)

		Hash value representing the device code from which the binaries were compiled.

		Default value:  'key'.

--link-info <String>				(-link-info)

		Specify name of file into which the names of the sybols must be written


		are imported or exported by the compiled ptx module.

--machine <bits>					(-m)

		Specify 32-bit vs. 64-bit architecture.

		Allowed values for this option:  32,64.

		Default value:  64.

--maxrregcount <N>				  (-maxrregcount)

		Specify the maximum amount of registers that GPU functions can use. Until a function- specific limit, a higher value will generally increase the performance of individual GPU threads that execute this function. However, because thread registers are allocated from a global register pool on each GPU, a higher value of this option will also reduce the maximum thread block size, thereby reducing the amount of thread parallelism. Hence, a good maxrregcount value is the result of a trade-off.

		If this option is not specified, then no maximum is assumed. Otherwise the specified value will be rounded to the next multiple of 4 registers until the GPU specific maximum of 128 registers.

--opt-level <N>					 (-O)

		Specify optimization level.

		Default value:  4.

--options-file <file>,...		   (-optf)

		Include command line options from specified file.

--output-file <file>				(-o)

		Specify name of output file.

		Default value:  'cubin.bin'.

--return-at-end					 (-ret-end)

		Normally, ptxas optimizes return instructions at the end of the program. However, for debuggable code this causes problems setting breakpoint at the end. This option prevents OCG from optimizing this last return instruction.

--trace-memory					  (-mem)

		Trace memory usage.

--translation-map <String>		  (-tmap)

		Specify name of file into which ptx to assembly address translation map


		be written.

--verbose						   (-v)

		Enable verbose mode.

--version						   (-V)

		Print version information on this tool.

I saw that options :)

Situation.Update -> As it seems byte code is written in bincode {…}, that’s ok with compiling, now I have to figure out how to start that virtual machine and configure to run my code, as it seems shaders and PTX work the same way ;)

.cubin is what actually executes on device, so I personally would not call it ‘useless’. Especially if you’re talking about compilers.

I know i was wrong, the data contained in bincode inside cubin file, and it is specially for my GTX260? will the same file give me another cubin.bin if I put 9600, 8800 or any other card instead of mine? If it’s so, then how does a 44KB EXE and cubin.dll work on any cuda-supported GPU?

loooool me, only now I found NVAPI in developer zone :D can anyone tell me which functions do that? Or I have to do it only myself? :)

You can find this in the nvcc documentation pdf. Look up “Device code repositories”

Hmmm, interesting, but as it seems it only allows a cuda-based application developed on visual studio to use these features :( I found no information how to make a ptx file run and call its functions from any other language :unsure:

Now what? Should I disassemble a cuda-based app and spend hous in front of IDA? :argh: this is a nonsense!!!

Every post you make confuses me more and more :)
NVAPI is not related to CUDA, AFAIK.

Contents of .cubin file are determined not by the card you have installed in your system but by set of parameters you pass to ptxas. Yes, code compiled for 8-, 9- and 200- series will be differnet. However, It seems to be backward-compatible, i.e. code for 8-series runs on 9- and 200-series.

What is cubin.dll?

You can’t call code contained in ptx from other languages.

What are you going to disassemble? I strongly encourage you to explain your questions/problem in more details, otherwise chances of getting answer will become low.

I know I cannot call a PTX function like a common function, like


CALL gpu_sqrt

But there should be a way to make this virtual machine work, instruct with CUBIN code and invoke an .entry, but how to do this? Maybe the answer lies in every cuda-based application? I’m going to make a simple CUDA project with VS2005, add every debug info possible and fallow step-by-step what methods are called while working with GPU.

But every “post” of your president confuses me more and more :P

there is no virtual machine.

Why don’t you just read the driver API examples to see how to invoke kernels (they are in .cubin files or code repositories and as far as I understand there will be on the fly recompilation if needed when ptx code is included).

Then you can just call those functions used in the examples from pascal or am I thinking too easy?

Listen man, I don’t know every sample code or docs ever written by NV devs, and if you know, please be kind and post one of them, I’m really not the one who reads gigabytes of PDF-s and millions of source code lines before asking questions so it would be very nice if I knew where to find this (as it seems you know it well).

One question: you name “Driver API” many times, but I couldn’t find a whole set of headers, the only thing was NVAPI, that really is not what I need (I had enough headache with I2C and logic analyzers this summer and NVAPI reminds me that :unsure: :wacko: :blink: )

finally, do you know a practical method how to invoke for example this .entry from any C or PASCAL lang?

.version 1.3

.target sm_13

.func (.reg .f32 j) kukuza (.reg .f32 m, .reg .f32 n)


 mul.f32 j, m, n;



.entry gpu_sqrt


 .param .f32 a; 

 .reg .f32 res,tmp,v;

ld.param.f32 tmp,[a];

sqrt.f32 res,tmp; 

call (v), kukuza, (res,tmp);



(couldn’t write a code more stupid than this :P )

************* [update after 15 seconds :P]

w-w-w-w-w-ill y say it’s all about CUDA driver api and I’m such a fool to search for whole display driver api!!! :unsure: :blink: :unsure:


We’ve found the code sample E.D. Riedijk was talking about (today at university-s ancient comp :D ), so pre-reqs for the first task are done, now we’re writing an IDE to edit, compile and test PTX source files on the fly!!!

NOTE: If anyone wants to join our team for making this crazy idea come true always WELCOME! (We have the third member from http://www.forum.ge at this time :) )

People! We need help, does anyone know how to compile a cu file to PTX? specifying nvcc.exe -ptx something.cu doesn’t work :(

nvcc.pdf definitely knows how to do it.

nvcc.pdf ?

We’ve figured it ourselves how to do that and calculated first SQRT on my GTX260, the “sample” code was

.version 1.3

	.target sm_10, map_f64_to_f32

	// compiled with ./../open64/lib//be.exe

	// nvopencc built on 2008-11-07

	.reg .u32 %ra<17>;

	.reg .u64 %rda<17>;

	.reg .f32 %fa<17>;

	.reg .f64 %fda<17>;

	.reg .u32 %rv<5>;

	.reg .u64 %rdv<5>;

	.reg .f32 %fv<5>;

	.reg .f64 %fdv<5>;


	// Compiling C:\Users\Delphi\AppData\Local\Temp/tmpxft_00004d98_00000000-9_simple.cpp3.i (C:/Users/Delphi/AppData/Local/Temp/ccBI#.a19984)



	// Options:


	//  Target:ptx, ISA:sm_10, Endian:little, Pointer Size:32

	//  -O3	(Optimization level)

	//  -g0	(Debug level)

	//  -m2	(Report advisories)


	.file	1	"C:\Users\Delphi\AppData\Local\Temp/tmpxft_00004d98_00000000-8_simple.cudafe2.gpu"

	.file	2	"F:\Programs\visual studio 2005\VC\INCLUDE\crtdefs.h"

	.file	3	"device_runtime.h"

	.file	4	"include\host_defines.h"

	.file	5	"include\builtin_types.h"

	.file	6	"h:\program files\cuda toolkit\include\device_types.h"

	.file	7	"h:\program files\cuda toolkit\include\driver_types.h"

	.file	8	"h:\program files\cuda toolkit\include\texture_types.h"

	.file	9	"h:\program files\cuda toolkit\include\vector_types.h"

	.file	10	"h:\program files\cuda toolkit\include\host_defines.h"

	.file	11	"include\device_launch_parameters.h"

	.file	12	"h:\program files\cuda toolkit\include\crt\storage_class.h"

	.file	13	"F:\Programs\visual studio 2005\VC\INCLUDE\time.h"

	.file	14	"simple.cu"

	.file	15	"include\common_functions.h"

	.file	16	"h:\program files\cuda toolkit\include\crt/func_macro.h"

	.file	17	"h:\program files\cuda toolkit\include\math_functions.h"

	.file	18	"h:\program files\cuda toolkit\include\device_functions.h"

	.file	19	"h:\program files\cuda toolkit\include\math_constants.h"

	.file	20	"h:\program files\cuda toolkit\include\sm_11_atomic_functions.h"

	.file	21	"h:\program files\cuda toolkit\include\sm_12_atomic_functions.h"

	.file	22	"h:\program files\cuda toolkit\include\sm_13_double_functions.h"

	.file	23	"h:\program files\cuda toolkit\include\texture_fetch_functions.h"

	.file	24	"h:\program files\cuda toolkit\include\math_functions_dbl_ptx1.h"

	.entry _Z9calculateffPf


	.reg .u32 %r<3>;

	.reg .f32 %f<5>;

	.param .f32 __cudaparm__Z9calculateffPf_a;

	.param .f32 __cudaparm__Z9calculateffPf_b;

	.param .u32 __cudaparm__Z9calculateffPf_result;

	.loc	14	1	0


	.loc	14	3	0

	ld.param.f32 	%f1, [__cudaparm__Z9calculateffPf_a];	// id:7 __cudaparm__Z9calculateffPf_a+0x0

	ld.param.f32 	%f2, [__cudaparm__Z9calculateffPf_b];	// id:8 __cudaparm__Z9calculateffPf_b+0x0

	add.f32 	%f3, %f1, %f2;	   	// 

	ld.param.u32 	%r1, [__cudaparm__Z9calculateffPf_result];	// id:9 __cudaparm__Z9calculateffPf_result+0x0

	st.global.f32 	[%r1+0], %f3;	  // id:10

	.loc	14	4	0

	exit;						 	// 


	} // _Z9calculateffPf

made some “undotfuscatoring” and got

.version 1.3

.target sm_13

.entry calc


.reg .u32 A;

.reg .f32 B;

.param .f32 a;

.param .f32 b;

.param .u32 res;

.reg .f32 f1,f2,f3;

ld.param.f32 f1, [a];

ld.param.f32 f2, [b];

add.f32 f3, f1, f2;

ld.param.u32 A, [res];

st.global.f32 [A], f3;



works fine, as it seems transfer rate between video card and RAM is very poor, 2.6GB/s.

now we’re working at an automated code generator and IDE ;)

P.S. there was none of detailed info in NVCC docs how to compile CU to PTX

nvcc -ptx?