self modifying code

Is there or will there be the possibility to write self modifing code in CUDA ?
It could help to reduce the amount of registers and therefore further help to increase the number of threads that might run at the same time.

I’d like to add a further compatible question. Is there likely to be a way to dynamically generate code?

I am very interested in the new work being done by Alan Kay, and many of the other SmallTalk luminaries, along with some newer pleople like Ian Piumarta, Anyone interested in truly amazing software should have a look at Viewpoints Research Institute, and especially the publications Writings

Specifically the paper Steps Toward the Reinvention of Programming by A. Kay, I. Piumarta, K. Rose, D. Ingalls, D. Amelang, T. Kaehler, Y. Ohshima, C. Thacker, S. Wallace, A. Warth, T. Yamamiya.

In a couple of projects (JitBlt, Gezira), they are writing (admittedly 2D) powerful graphics systems in only a few hundered lines of code, and getting good performance by ‘JIT’ compiling machine code. I believe the technique has been migrated back into Cairo because it generates very fast rendering code. Very cool.

The cognate is dynamically generating the code for a kernel for a specific data set. I am thinking that I don’t want to generate CUDA code, and go via nvopencc, but generate a form of ptx into a memory buffer, and hand that over to the driver to handle it.

I’d like to know if/when we’ll be able to implement direct, fast, ptx generation (on the host side) for immediate loading and execution on the GPU?


I don’t see any reason why you couldn’t generate PTX on the host side, compile it to a shared library of some sort using nvcc/ptxas, and then use dlopen()/dlsym()/dlclose() (Unix) or LoadLibrary() etc on Windows, to use the dynamically generated code. People have been doing things like this for years as a somewhat portable means of generating native machine code. It’s not as fast as banging the metal and writing to the text segment while code is running, but that stuff is becoming more difficult as the security protocols enforced by modern operating systems have begun to frown on these things. Generating PTX, linking to a shared lib and subsequently loading it won’t be very fast, but you can at least prototype your code if it’s that important to you.


Yes, I’ve done this… the generate CUDA code on the fly thing, that is. It’s not very different from generating shaders on the fly like is usual with graphics programming. Just print out ptx or c code to a string and feed it to the compiler. Then load it using the CUDA API.

I’m quite sure self-modification is not possible. CUDA programs can read the memory where they are hosted, but writing to it doesn’t do anything (also, remember that CUDA is meant to be very parallel, and all blocks are executing the same code… it will always be a mess)

Wumpus, thanks, that sounds right.

I assume this works by creating a cubin ‘file’ on the fly, then calling cuModuleLoad() or cuModuleLoadData()?

Can you explain what cuModuleLoadData does? I don’t get the documentation, and I can’t find an example or a mention on the forums (or by googling), but it looks like I can pass a faked-up cubin file as a string in the ‘image’ parameter.

I would want this to be pretty quick (milli-seconds). Is the method you’ve tried in that ball-park?



I guess you mean using cuModuleLoad(), cuLaunchGrid() etc… Is it possible to mix these Driver API functions and kernel launches with my original code that uses the Runtime API, or do I have to rewrite all of my code to use the driver API only?


I think I read somewhere that you cannot mix those two together.

Section 4.5 Host Runtime Component of the programming guide