Ocelot - Finding the PTX (Cat) inside the executable (Bag) Is Ocelot Dependent on the CUDA version?

Correct, the parser I wrote is a text processor. It reads PTX assembly language, outputs a tree. I don’t look at the machine code. But, I don’t know if Ocelot actually does that either (assertM(binary->ptx->ptx != 0, “binary contains no PTX”); line 488 of http://code.google.com/p/gpuocelot/source/…CudaRuntime.cpp, r714). The __cudaFatCudaBinary structure contains a pointer to an array of __cudaFatPtxEntry structures, which contains the PTX assembly language code. (I could be wrong, but all my examples work that way. binary->ptx->ptx is text, not machine code. And, I think binary->ptx points to a null terminated array of __cudaFatPtxEntry, not just one record, because some of my examples work that way, too.) I would assume that if you want the machine code, you’d go down the __cudaFatCubinEntry, __cudaFatElfEntry, and/or one of the other members of the __cudaFatCudaBinary structure. Barra, another emulator, does decode machine code, but it hooks the driver API (cuModuleLoad…), not the runtime API (cudaRegisterFatBinary), so it doesn’t use this structure (as far as I know reading their code). Really, the best way to learn this stuff is to write an API hooking layer, then start debugging example CUDA programs. --Ken D.

Correct, the parser I wrote is a text processor. It reads PTX assembly language, outputs a tree. I don’t look at the machine code. But, I don’t know if Ocelot actually does that either (assertM(binary->ptx->ptx != 0, “binary contains no PTX”); line 488 of http://code.google.com/p/gpuocelot/source/…CudaRuntime.cpp, r714). The __cudaFatCudaBinary structure contains a pointer to an array of __cudaFatPtxEntry structures, which contains the PTX assembly language code. (I could be wrong, but all my examples work that way. binary->ptx->ptx is text, not machine code. And, I think binary->ptx points to a null terminated array of __cudaFatPtxEntry, not just one record, because some of my examples work that way, too.) I would assume that if you want the machine code, you’d go down the __cudaFatCubinEntry, __cudaFatElfEntry, and/or one of the other members of the __cudaFatCudaBinary structure. Barra, another emulator, does decode machine code, but it hooks the driver API (cuModuleLoad…), not the runtime API (cudaRegisterFatBinary), so it doesn’t use this structure (as far as I know reading their code). Really, the best way to learn this stuff is to write an API hooking layer, then start debugging example CUDA programs. --Ken D.

I have seen that part of code… I relate it to “nvcc --cuda xxx.cu” output.

I thought binary->ptx->ptx is machine code… It is not??? Thats cool!
But I wonder why NVIDIA would store ascii ptx out there… Looks weird though. But as u rightly say, “cubin” is the binary… Thats where probbaly the machine code is there. Thanks for pointing out!

But if Ocelot is doing “PTX” transformation – it finally has to create a “cubin” for that – atleast for the GPU target that they support now. I am very curious to know how they get it done! Any inputs?

I have seen that part of code… I relate it to “nvcc --cuda xxx.cu” output.

I thought binary->ptx->ptx is machine code… It is not??? Thats cool!
But I wonder why NVIDIA would store ascii ptx out there… Looks weird though. But as u rightly say, “cubin” is the binary… Thats where probbaly the machine code is there. Thanks for pointing out!

But if Ocelot is doing “PTX” transformation – it finally has to create a “cubin” for that – atleast for the GPU target that they support now. I am very curious to know how they get it done! Any inputs?

Ken,

Thanks! Verified that binary->ptx->ptx is actually “text”. I just wrote a program to print it out!
Thanks a lot for opening my eyes!

Best Regards,
Sarnath

Ken,

Thanks! Verified that binary->ptx->ptx is actually “text”. I just wrote a program to print it out!
Thanks a lot for opening my eyes!

Best Regards,
Sarnath

This is correct, except we also have to deal with __cudaFatCudaBinary to implement cuModuleLoadFatBinary.

cudaRegisterFatBinary seems to be mostly a wrapper over cuModuleLoadFatBinary.

We just read binary->cubin->cubin and ignore the PTX. Actually, the cubin used to be ASCII text too, then was switched to ELF with CUDA 3.0. It is the same data as produced by nvcc --cubin, byte for byte.

This is correct, except we also have to deal with __cudaFatCudaBinary to implement cuModuleLoadFatBinary.

cudaRegisterFatBinary seems to be mostly a wrapper over cuModuleLoadFatBinary.

We just read binary->cubin->cubin and ignore the PTX. Actually, the cubin used to be ASCII text too, then was switched to ELF with CUDA 3.0. It is the same data as produced by nvcc --cubin, byte for byte.

Thanks to all of you for your answers!

Thanks to all of you for your answers!