PTX "Assembler" Rolling my own?

I’ve been thinking about this for a while now – is there any documentation available that details how PTX instructions are “assembled” for the device, transferred, and executed?

As a proof-of-concept, I’m interested in building a lightweight visual assembler for PTX instructions that developers can use to optimize their routines. I would like for this program to be able to assemble the current program, upload it to the device, and run it. I assume this would involve some undocumented driver functions though, so I suppose that’s what I need for this to work.

If anyone from nVidia is interested in hearing more about this (or wants to help), but can’t release the information publicly, please PM me so we can discuss. I plan for this to be a free tool, so I won’t be profiting from the information…

Have you seen decuda and cudasm?

Yeah, I looked at them before…I suppose I never put 2 and 2 together to realize that I could just compile a .cubin file and use the driver API to load it to the device (cuModuleLoad()).

The programming reference says that cuModuleLoad() loads the module into the current context…does this mean that the code in the module is executed when the loading is complete?

ptx code is transferred into cubin format by ptxas. Using the driver API you can load & execute functions (kernels) from the cubin as far as I understand the driver API, I did not have to use the driver API yet. It is in section 4.5.3 (beta2.0 programming guide)

Thanks for the replies all…I think I’m going to wrap cudasm for the time-being (pass the text from the program to it and compile a cubin file), then use the driver API to load and execute the module (see section of the programming guide).

Probably the best reference for PTX is --keep-ptx and playing around with assembly code nvcc generates as the PTX ISA is a bit out of sync (still claims vector and swizzle operations) with what ptxas accepts. PM me if you want to bounce ideas, I’ve dug through a lot of the internals with regards to workarounds.


It’s actually not necessary to use the Driver API.

The device code repository is a neat feature. All CUDA exes automatically search for a special folder in their working directory that might contain updated versions of the kernels in either cubin or ptx format.

Well, I was thinking about doing this as a sort of assembler/debugger (if possible). I was hoping to take the PTX instructions and run them directly on the device via the driver API…now that I find the right driver function, I can do that without having to create an executable.

If I can get this program working like I want it to, it may be rolled into a larger project that I have some ideas for…