Ocelot - Finding the PTX (Cat) inside the executable (Bag) Is Ocelot Dependent on the CUDA version?

Sarnath · October 5, 2010, 12:20pm

Hi There!

I just read an old paper from “Gregory Diamos” on Ocelot. The paper is titled “The design and implementation of Ocelot’s Dynamic Binary Translator from PTX to multi-core x86”.

Section III A, talks about how to extract PTX binary information. As I tried to verify the information, I realize the the way the binary is registered and stored in CUDA 2.3 (thats what I have on my linux box) is quite different from what the paper claims.
For example, there is only a single constructor that registers all kernels (as opposed to a constructor per kernel), the extern variable “fatBinary” no more exists and so on.

So, I assume that NVIDIA has chagned their internal binary representation and their APIs a bit in some CUDA version.

So, my questions to Greg (and to other knowledgeable ones) are:
0. Is my assumption right? I hope I am talking sense here… If no, dont read the next 2 questions.

Is Ocelot particular to any CUDA version? OR Does it have compatibility issues with CUDA versions?
Is Ocelot being updated, everytime NVIDIA decides to change their binary layout?

Thank you,
Best Regards,
Sarnath

Sarnath · October 5, 2010, 12:20pm

Hi There!

I just read an old paper from “Gregory Diamos” on Ocelot. The paper is titled “The design and implementation of Ocelot’s Dynamic Binary Translator from PTX to multi-core x86”.

Section III A, talks about how to extract PTX binary information. As I tried to verify the information, I realize the the way the binary is registered and stored in CUDA 2.3 (thats what I have on my linux box) is quite different from what the paper claims.
For example, there is only a single constructor that registers all kernels (as opposed to a constructor per kernel), the extern variable “fatBinary” no more exists and so on.

So, I assume that NVIDIA has chagned their internal binary representation and their APIs a bit in some CUDA version.

So, my questions to Greg (and to other knowledgeable ones) are:
0. Is my assumption right? I hope I am talking sense here… If no, dont read the next 2 questions.

Is Ocelot particular to any CUDA version? OR Does it have compatibility issues with CUDA versions?
Is Ocelot being updated, everytime NVIDIA decides to change their binary layout?

Thank you,
Best Regards,
Sarnath

Gregory_Diamos · October 5, 2010, 8:03pm

So when I wrote that part of the paper I was trying to write in generic terms rather than by referring to every variable by name. So by a fat binary, I meant a

typedef struct __cudaFatCudaBinaryRec {

	unsigned long			magic;

	unsigned long			version;

	unsigned long			gpuInfoVersion;

	char*				   key;

	char*				   ident;

	char*				   usageMode;

	__cudaFatPtxEntry			 *ptx;

	__cudaFatCubinEntry		   *cubin;

	__cudaFatDebugEntry		   *debug;

	void*				  debugInfo;

	unsigned int				   flags;

	__cudaFatSymbol			   *exported;

	__cudaFatSymbol			   *imported;

	struct __cudaFatCudaBinaryRec *dependends;

	unsigned int				   characteristic;

	__cudaFatElfEntry			 *elf;

} __cudaFatCudaBinary;

This has actually remained more or less constant from one cuda version to the next. Each fat binary is registered once via a global constructor, and then a series of __cudaRegisterFunction calls are made, one for each kernel, also by global constructors.

NVIDIA adds and removes API functions on pretty much every CUDA version, so even though the binary format is fairly standard, a version of Ocelot is typically tied to a version of CUDA.

It would be if the binary format ever changed significantly enough to cause a bug.

Gregory_Diamos · October 5, 2010, 8:03pm

So when I wrote that part of the paper I was trying to write in generic terms rather than by referring to every variable by name. So by a fat binary, I meant a

typedef struct __cudaFatCudaBinaryRec {

	unsigned long			magic;

	unsigned long			version;

	unsigned long			gpuInfoVersion;

	char*				   key;

	char*				   ident;

	char*				   usageMode;

	__cudaFatPtxEntry			 *ptx;

	__cudaFatCubinEntry		   *cubin;

	__cudaFatDebugEntry		   *debug;

	void*				  debugInfo;

	unsigned int				   flags;

	__cudaFatSymbol			   *exported;

	__cudaFatSymbol			   *imported;

	struct __cudaFatCudaBinaryRec *dependends;

	unsigned int				   characteristic;

	__cudaFatElfEntry			 *elf;

} __cudaFatCudaBinary;

This has actually remained more or less constant from one cuda version to the next. Each fat binary is registered once via a global constructor, and then a series of __cudaRegisterFunction calls are made, one for each kernel, also by global constructors.

NVIDIA adds and removes API functions on pretty much every CUDA version, so even though the binary format is fairly standard, a version of Ocelot is typically tied to a version of CUDA.

It would be if the binary format ever changed significantly enough to cause a bug.

Sarnath · October 6, 2010, 5:12am

Vow! Thanks for answering! Its the same that I am seeing here… but how could one get the “PTX” corresponding to each kernel name from this info?

This was the most difficult part for me…I see an array of binary numbers out there… but which one corresponds to which kernel? How did you figure that out? btw, Great work!

Thanks in advance!

Sarnath · October 6, 2010, 5:12am

Vow! Thanks for answering! Its the same that I am seeing here… but how could one get the “PTX” corresponding to each kernel name from this info?

This was the most difficult part for me…I see an array of binary numbers out there… but which one corresponds to which kernel? How did you figure that out? btw, Great work!

Thanks in advance!

Sarnath · October 6, 2010, 6:14am

I found the answer by going through ocelot source… extractPTXKernels().

So, it is all done by PTX Parsing… My god!!

Is the whole PTX parsing documented anywhere? (I know only the PTX ISA manual). OR Did you reverse engineer this one out ?

Sarnath · October 6, 2010, 6:14am

I found the answer by going through ocelot source… extractPTXKernels().

So, it is all done by PTX Parsing… My god!!

Is the whole PTX parsing documented anywhere? (I know only the PTX ISA manual). OR Did you reverse engineer this one out ?

coutinho · October 6, 2010, 9:26am

A easy way to see how ocelot parses a PTX is compiling a program with --extern=all and passing a PTX file to PTXOptimizer, or CFG.

PTX parsing is mostly:

1st phase: read a PTX and generate a array of PTXStatements. I don’t know much about this phase, sorry :(

2nd phase: convert the array of PTXStatements to several PTXKernals, each with a CFG.

extractPTXKerneks(): finds kernel limits in code, call PTXKernel() constructors passing iterators with kernel limits
PTXKernel constructor: do some bookkeeping and call constructCFG() that does all heavy lifting.

coutinho · October 6, 2010, 9:26am

A easy way to see how ocelot parses a PTX is compiling a program with --extern=all and passing a PTX file to PTXOptimizer, or CFG.

PTX parsing is mostly:

1st phase: read a PTX and generate a array of PTXStatements. I don’t know much about this phase, sorry :(

2nd phase: convert the array of PTXStatements to several PTXKernals, each with a CFG.

extractPTXKerneks(): finds kernel limits in code, call PTXKernel() constructors passing iterators with kernel limits
PTXKernel constructor: do some bookkeeping and call constructCFG() that does all heavy lifting.

Sarnath · October 6, 2010, 9:36am

Thanks for your answer,

The PTX parsing phase is the most interesting one. Do you know where I can find documentation on it? Thanks!

Sarnath · October 6, 2010, 9:36am

Thanks for your answer,

The PTX parsing phase is the most interesting one. Do you know where I can find documentation on it? Thanks!

Gregory_Diamos · October 6, 2010, 12:13pm

There isn’t a whole lot of documentation on the PTX parser. It is composed of a FLEX description of how to convert text into tokens and then a BISON parser that converts from tokens into PTXStatements. This parser makes callbacks into a PTXParser class that actually creates the statements when a pattern is matched. This is done in order to make the BISON source file more readable.

Lexer: http://code.google.com/p/gpuocelot/source/…ntation/ptx.lpp

Bison Grammar: http://code.google.com/p/gpuocelot/source/…/ptxgrammar.ypp

PTXParser: http://code.google.com/p/gpuocelot/source/…n/PTXParser.cpp

There is also a class that wraps the lexer (PTXLexer) to make it more C++ friendly.

Edit: As for determining how to write the parser, I just looked at the PTX ISA manual and started from there.

Gregory_Diamos · October 6, 2010, 12:13pm

There isn’t a whole lot of documentation on the PTX parser. It is composed of a FLEX description of how to convert text into tokens and then a BISON parser that converts from tokens into PTXStatements. This parser makes callbacks into a PTXParser class that actually creates the statements when a pattern is matched. This is done in order to make the BISON source file more readable.

Lexer: http://code.google.com/p/gpuocelot/source/…ntation/ptx.lpp

Bison Grammar: http://code.google.com/p/gpuocelot/source/…/ptxgrammar.ypp

PTXParser: http://code.google.com/p/gpuocelot/source/…n/PTXParser.cpp

There is also a class that wraps the lexer (PTXLexer) to make it more C++ friendly.

Edit: As for determining how to write the parser, I just looked at the PTX ISA manual and started from there.

Ken_Domino · October 6, 2010, 6:34pm

If you’re interested in another version of a PTX grammar, I wrote one as well, in Antlr. It is LL as opposed to LR. See [url=“Google Code Archive - Long-term storage for Google Code Project Hosting.”]Google Code Archive - Long-term storage for Google Code Project Hosting.. This grammar generates an AST as well, but I haven’t completed the documentation of that. The “PTX: Parallel Thread Execution ISA Version 2.2” doc has no grammar, and the descriptions of the instructions are not that great (plus errors). So I had also to just derive the grammar from compiler output and hand-written tests that pushed the syntax, then tested them using ptxas and a cuda driver program. Antlrworks can be used to view the parse and AST of a PTX source file. --Ken D.

Ken_Domino · October 6, 2010, 6:34pm

If you’re interested in another version of a PTX grammar, I wrote one as well, in Antlr. It is LL as opposed to LR. See [url=“Google Code Archive - Long-term storage for Google Code Project Hosting.”]Google Code Archive - Long-term storage for Google Code Project Hosting.. This grammar generates an AST as well, but I haven’t completed the documentation of that. The “PTX: Parallel Thread Execution ISA Version 2.2” doc has no grammar, and the descriptions of the instructions are not that great (plus errors). So I had also to just derive the grammar from compiler output and hand-written tests that pushed the syntax, then tested them using ptxas and a cuda driver program. Antlrworks can be used to view the parse and AST of a PTX source file. --Ken D.

Sarnath · October 7, 2010, 8:55am

Hi Greg,

So PTX ISA manual was your start… Hmm… Very interesting!! And, really awesome , gutsy work!
Hmm… The PTX binary is just an assembled version of the PTX… So… the manual must have been helpful!

btw, Did you ever use “decuda” to understand anything?
I think “decuda” comes after the PTX stage… So must not have been a great help! Can you confirm?

Thanks for writing back,

Ken,
Thanks for sharing! Its going to be useful! Many Thanks!
btw, I think that is applicable to the “PTX” assembly language in “text” format, right?
I may need to do some conversion before using it… I will check out. THanks!

Best REgards,
Sarnath

Sarnath · October 7, 2010, 8:55am

Hi Greg,

So PTX ISA manual was your start… Hmm… Very interesting!! And, really awesome , gutsy work!
Hmm… The PTX binary is just an assembled version of the PTX… So… the manual must have been helpful!

btw, Did you ever use “decuda” to understand anything?
I think “decuda” comes after the PTX stage… So must not have been a great help! Can you confirm?

Thanks for writing back,

Ken,
Thanks for sharing! Its going to be useful! Many Thanks!
btw, I think that is applicable to the “PTX” assembly language in “text” format, right?
I may need to do some conversion before using it… I will check out. THanks!

Best REgards,
Sarnath

Sarnath · October 7, 2010, 9:05am

Ken,

As I was reading through your “cuda-waste” project, I was just wondering if you ever tried compiling “Ocelot” under cygwin. I would guess it must be a minor thing.

Anyway, Good luck on your project!

And if I can take the liberty, Can I ask you to consider a better name for your project (than cuda-waste).

THanks,

Sarnath · October 7, 2010, 9:05am

Ken,

As I was reading through your “cuda-waste” project, I was just wondering if you ever tried compiling “Ocelot” under cygwin. I would guess it must be a minor thing.

Anyway, Good luck on your project!

And if I can take the liberty, Can I ask you to consider a better name for your project (than cuda-waste).

THanks,

Topic		Replies	Views
PTX in binary ? CUDA Programming and Performance	9	7768	June 20, 2011
PTX Emulator Released CUDA Programming and Performance	32	8292	July 15, 2009
Ocelot 1.0 Alpha Release High Performance GPU and Multi-core CPU targets CUDA Programming and Performance	27	59842	January 1, 2010
Is emulation mode removed from CUDA 3.0? CUDA Programming and Performance	23	22601	July 3, 2010
NVIDIA has hade a huge mistake with HW debugger Single-GPU debugging not supported and no emulation& CUDA Programming and Performance	34	6029	August 7, 2010
Ocelot PTX Debugger CUDA Programming and Performance	5	8006	July 23, 2010
Going to learn PTX and write a GPU compiler CUDA Programming and Performance	20	26841	January 19, 2009
OpenCL or CUDA? CUDA Programming and Performance	16	10953	October 26, 2011
Cuda for Pascal(Delphi) ? CUDA Programming and Performance	47	154316	January 12, 2011
Ability to run PTX directly CUDA Programming and Performance	2	4391	November 11, 2009

Ocelot - Finding the PTX (Cat) inside the executable (Bag) Is Ocelot Dependent on the CUDA version?

Related topics