Ocelot - Finding the PTX (Cat) inside the executable (Bag) Is Ocelot Dependent on the CUDA version?

Hi There!

I just read an old paper from “Gregory Diamos” on Ocelot. The paper is titled “The design and implementation of Ocelot’s Dynamic Binary Translator from PTX to multi-core x86”.

Section III A, talks about how to extract PTX binary information. As I tried to verify the information, I realize the the way the binary is registered and stored in CUDA 2.3 (thats what I have on my linux box) is quite different from what the paper claims.
For example, there is only a single constructor that registers all kernels (as opposed to a constructor per kernel), the extern variable “fatBinary” no more exists and so on.

So, I assume that NVIDIA has chagned their internal binary representation and their APIs a bit in some CUDA version.

So, my questions to Greg (and to other knowledgeable ones) are:
0. Is my assumption right? I hope I am talking sense here… If no, dont read the next 2 questions.

  1. Is Ocelot particular to any CUDA version? OR Does it have compatibility issues with CUDA versions?
  2. Is Ocelot being updated, everytime NVIDIA decides to change their binary layout?

Thank you,
Best Regards,
Sarnath

Hi There!

I just read an old paper from “Gregory Diamos” on Ocelot. The paper is titled “The design and implementation of Ocelot’s Dynamic Binary Translator from PTX to multi-core x86”.

Section III A, talks about how to extract PTX binary information. As I tried to verify the information, I realize the the way the binary is registered and stored in CUDA 2.3 (thats what I have on my linux box) is quite different from what the paper claims.
For example, there is only a single constructor that registers all kernels (as opposed to a constructor per kernel), the extern variable “fatBinary” no more exists and so on.

So, I assume that NVIDIA has chagned their internal binary representation and their APIs a bit in some CUDA version.

So, my questions to Greg (and to other knowledgeable ones) are:
0. Is my assumption right? I hope I am talking sense here… If no, dont read the next 2 questions.

  1. Is Ocelot particular to any CUDA version? OR Does it have compatibility issues with CUDA versions?
  2. Is Ocelot being updated, everytime NVIDIA decides to change their binary layout?

Thank you,
Best Regards,
Sarnath

So when I wrote that part of the paper I was trying to write in generic terms rather than by referring to every variable by name. So by a fat binary, I meant a

typedef struct __cudaFatCudaBinaryRec {

	unsigned long			magic;

	unsigned long			version;

	unsigned long			gpuInfoVersion;

	char*				   key;

	char*				   ident;

	char*				   usageMode;

	__cudaFatPtxEntry			 *ptx;

	__cudaFatCubinEntry		   *cubin;

	__cudaFatDebugEntry		   *debug;

	void*				  debugInfo;

	unsigned int				   flags;

	__cudaFatSymbol			   *exported;

	__cudaFatSymbol			   *imported;

	struct __cudaFatCudaBinaryRec *dependends;

	unsigned int				   characteristic;

	__cudaFatElfEntry			 *elf;

} __cudaFatCudaBinary;

This has actually remained more or less constant from one cuda version to the next. Each fat binary is registered once via a global constructor, and then a series of __cudaRegisterFunction calls are made, one for each kernel, also by global constructors.

NVIDIA adds and removes API functions on pretty much every CUDA version, so even though the binary format is fairly standard, a version of Ocelot is typically tied to a version of CUDA.

It would be if the binary format ever changed significantly enough to cause a bug.

So when I wrote that part of the paper I was trying to write in generic terms rather than by referring to every variable by name. So by a fat binary, I meant a

typedef struct __cudaFatCudaBinaryRec {

	unsigned long			magic;

	unsigned long			version;

	unsigned long			gpuInfoVersion;

	char*				   key;

	char*				   ident;

	char*				   usageMode;

	__cudaFatPtxEntry			 *ptx;

	__cudaFatCubinEntry		   *cubin;

	__cudaFatDebugEntry		   *debug;

	void*				  debugInfo;

	unsigned int				   flags;

	__cudaFatSymbol			   *exported;

	__cudaFatSymbol			   *imported;

	struct __cudaFatCudaBinaryRec *dependends;

	unsigned int				   characteristic;

	__cudaFatElfEntry			 *elf;

} __cudaFatCudaBinary;

This has actually remained more or less constant from one cuda version to the next. Each fat binary is registered once via a global constructor, and then a series of __cudaRegisterFunction calls are made, one for each kernel, also by global constructors.

NVIDIA adds and removes API functions on pretty much every CUDA version, so even though the binary format is fairly standard, a version of Ocelot is typically tied to a version of CUDA.

It would be if the binary format ever changed significantly enough to cause a bug.

Vow! Thanks for answering! Its the same that I am seeing here… but how could one get the “PTX” corresponding to each kernel name from this info?

This was the most difficult part for me…I see an array of binary numbers out there… but which one corresponds to which kernel? How did you figure that out? btw, Great work!

Thanks in advance!

Vow! Thanks for answering! Its the same that I am seeing here… but how could one get the “PTX” corresponding to each kernel name from this info?

This was the most difficult part for me…I see an array of binary numbers out there… but which one corresponds to which kernel? How did you figure that out? btw, Great work!

Thanks in advance!

I found the answer by going through ocelot source… extractPTXKernels().

So, it is all done by PTX Parsing… My god!!

Is the whole PTX parsing documented anywhere? (I know only the PTX ISA manual). OR Did you reverse engineer this one out ?

I found the answer by going through ocelot source… extractPTXKernels().

So, it is all done by PTX Parsing… My god!!

Is the whole PTX parsing documented anywhere? (I know only the PTX ISA manual). OR Did you reverse engineer this one out ?

A easy way to see how ocelot parses a PTX is compiling a program with --extern=all and passing a PTX file to PTXOptimizer, or CFG.

PTX parsing is mostly:

1st phase: read a PTX and generate a array of PTXStatements. I don’t know much about this phase, sorry :(

2nd phase: convert the array of PTXStatements to several PTXKernals, each with a CFG.

  • extractPTXKerneks(): finds kernel limits in code, call PTXKernel() constructors passing iterators with kernel limits
  • PTXKernel constructor: do some bookkeeping and call constructCFG() that does all heavy lifting.

A easy way to see how ocelot parses a PTX is compiling a program with --extern=all and passing a PTX file to PTXOptimizer, or CFG.

PTX parsing is mostly:

1st phase: read a PTX and generate a array of PTXStatements. I don’t know much about this phase, sorry :(

2nd phase: convert the array of PTXStatements to several PTXKernals, each with a CFG.

  • extractPTXKerneks(): finds kernel limits in code, call PTXKernel() constructors passing iterators with kernel limits
  • PTXKernel constructor: do some bookkeeping and call constructCFG() that does all heavy lifting.

Thanks for your answer,

The PTX parsing phase is the most interesting one. Do you know where I can find documentation on it? Thanks!

Thanks for your answer,

The PTX parsing phase is the most interesting one. Do you know where I can find documentation on it? Thanks!

There isn’t a whole lot of documentation on the PTX parser. It is composed of a FLEX description of how to convert text into tokens and then a BISON parser that converts from tokens into PTXStatements. This parser makes callbacks into a PTXParser class that actually creates the statements when a pattern is matched. This is done in order to make the BISON source file more readable.

Lexer: http://code.google.com/p/gpuocelot/source/…ntation/ptx.lpp

Bison Grammar: http://code.google.com/p/gpuocelot/source/…/ptxgrammar.ypp

PTXParser: http://code.google.com/p/gpuocelot/source/…n/PTXParser.cpp

There is also a class that wraps the lexer (PTXLexer) to make it more C++ friendly.

Edit: As for determining how to write the parser, I just looked at the PTX ISA manual and started from there.

There isn’t a whole lot of documentation on the PTX parser. It is composed of a FLEX description of how to convert text into tokens and then a BISON parser that converts from tokens into PTXStatements. This parser makes callbacks into a PTXParser class that actually creates the statements when a pattern is matched. This is done in order to make the BISON source file more readable.

Lexer: http://code.google.com/p/gpuocelot/source/…ntation/ptx.lpp

Bison Grammar: http://code.google.com/p/gpuocelot/source/…/ptxgrammar.ypp

PTXParser: http://code.google.com/p/gpuocelot/source/…n/PTXParser.cpp

There is also a class that wraps the lexer (PTXLexer) to make it more C++ friendly.

Edit: As for determining how to write the parser, I just looked at the PTX ISA manual and started from there.

If you’re interested in another version of a PTX grammar, I wrote one as well, in Antlr. It is LL as opposed to LR. See http://code.google.com/p/cuda-waste/source…runk/ptxp/Ptx.g. This grammar generates an AST as well, but I haven’t completed the documentation of that. The “PTX: Parallel Thread Execution ISA Version 2.2” doc has no grammar, and the descriptions of the instructions are not that great (plus errors). So I had also to just derive the grammar from compiler output and hand-written tests that pushed the syntax, then tested them using ptxas and a cuda driver program. Antlrworks can be used to view the parse and AST of a PTX source file. --Ken D.

If you’re interested in another version of a PTX grammar, I wrote one as well, in Antlr. It is LL as opposed to LR. See http://code.google.com/p/cuda-waste/source…runk/ptxp/Ptx.g. This grammar generates an AST as well, but I haven’t completed the documentation of that. The “PTX: Parallel Thread Execution ISA Version 2.2” doc has no grammar, and the descriptions of the instructions are not that great (plus errors). So I had also to just derive the grammar from compiler output and hand-written tests that pushed the syntax, then tested them using ptxas and a cuda driver program. Antlrworks can be used to view the parse and AST of a PTX source file. --Ken D.

Hi Greg,

So PTX ISA manual was your start… Hmm… Very interesting!! And, really awesome , gutsy work!
Hmm… The PTX binary is just an assembled version of the PTX… So… the manual must have been helpful!

btw, Did you ever use “decuda” to understand anything?
I think “decuda” comes after the PTX stage… So must not have been a great help! Can you confirm?

Thanks for writing back,

Ken,
Thanks for sharing! Its going to be useful! Many Thanks!
btw, I think that is applicable to the “PTX” assembly language in “text” format, right?
I may need to do some conversion before using it… I will check out. THanks!

Best REgards,
Sarnath

Hi Greg,

So PTX ISA manual was your start… Hmm… Very interesting!! And, really awesome , gutsy work!
Hmm… The PTX binary is just an assembled version of the PTX… So… the manual must have been helpful!

btw, Did you ever use “decuda” to understand anything?
I think “decuda” comes after the PTX stage… So must not have been a great help! Can you confirm?

Thanks for writing back,

Ken,
Thanks for sharing! Its going to be useful! Many Thanks!
btw, I think that is applicable to the “PTX” assembly language in “text” format, right?
I may need to do some conversion before using it… I will check out. THanks!

Best REgards,
Sarnath

Ken,

As I was reading through your “cuda-waste” project, I was just wondering if you ever tried compiling “Ocelot” under cygwin. I would guess it must be a minor thing.

Anyway, Good luck on your project!

And if I can take the liberty, Can I ask you to consider a better name for your project (than cuda-waste).

THanks,

Ken,

As I was reading through your “cuda-waste” project, I was just wondering if you ever tried compiling “Ocelot” under cygwin. I would guess it must be a minor thing.

Anyway, Good luck on your project!

And if I can take the liberty, Can I ask you to consider a better name for your project (than cuda-waste).

THanks,