Dynamic CodeGen inside kernel using Inline PTX

Hi,
I know this is quite a complex question, but i was wondering if there was any possible way to generate asm dynamically inside the kernel execution, to be executed later inside the same Thread.

The motivation behind this is that I want to execute huge amounts of different programs extremely efficiently (for an AI framework), but each program logic is different, so I tried to compile a new different kernel for each program, but the compile times were way bigger than the execution times, so most of the time the GPU was idling.

Another option was using a single kernel who would load an int array from global memory whose values represented the program logic, for instance, the value 0, meant “sum”, and 1 meant “multiply”, and so on, but the problem was the massive overhead from reading and comparing values from memory, resulting in poor performance.

So the solution would be to use Inline PTX to dynamically create the instructions for the program without the need of recompiling anything.

Something like this:

__global__
void computeCustomGraph(inputs...)
{
  // using Inline PTX, create some instructions and store them on an array or something similar...
  int instructions[1024] = ...
  for(int i=0; i<10000; i++){
    //execute the instructions created previously
  }
}```

But the issue is that I'm not sure what kind of instruction could be used to add more instructions to the execution, or how to run instructions stored in an array.

Any ideas?

PTX is only an intermediate stage in the compilation process, which in turn is then processed by the assembler in order to generate the machine code the GPU requires, so short of writing your own in kernel assembler, this solution won’t work.

I’m sure there are other complications as well, even if you were able to do this.

That is not supported. Note that PTX is not what is executed by the GPU. PTX is a virtual ISA and intermediate compiler representation, and has to be compiled to machine code (SASS) for a particular GPU architecture. This compilation can be performed either offline with the ptxas component of the toolchain (nvcc invokes it under the hood), or online using the JIT-compiler component baked into the CUDA driver.

The closest you can get is to dynamically generate PTX code in host code, and then use the CUDA driver interface to JIT-compile and load the resulting binary. I am aware of several apps that use this approach to generate code on the fly in response to user inputs. Whether this approach is suitable and performant enough for your use case, I cannot venture to guess.