Dynamic CodeGen inside kernel using Inline PTX

carlosychispas · October 10, 2021, 2:23am

Hi,
I know this is quite a complex question, but i was wondering if there was any possible way to generate asm dynamically inside the kernel execution, to be executed later inside the same Thread.

The motivation behind this is that I want to execute huge amounts of different programs extremely efficiently (for an AI framework), but each program logic is different, so I tried to compile a new different kernel for each program, but the compile times were way bigger than the execution times, so most of the time the GPU was idling.

Another option was using a single kernel who would load an int array from global memory whose values represented the program logic, for instance, the value 0, meant “sum”, and 1 meant “multiply”, and so on, but the problem was the massive overhead from reading and comparing values from memory, resulting in poor performance.

So the solution would be to use Inline PTX to dynamically create the instructions for the program without the need of recompiling anything.

Something like this:

__global__
void computeCustomGraph(inputs...)
{
  // using Inline PTX, create some instructions and store them on an array or something similar...
  int instructions[1024] = ...
  for(int i=0; i<10000; i++){
    //execute the instructions created previously
  }
}```

But the issue is that I'm not sure what kind of instruction could be used to add more instructions to the execution, or how to run instructions stored in an array.

Any ideas?

rs277 · October 10, 2021, 4:49am

PTX is only an intermediate stage in the compilation process, which in turn is then processed by the assembler in order to generate the machine code the GPU requires, so short of writing your own in kernel assembler, this solution won’t work.

I’m sure there are other complications as well, even if you were able to do this.

njuffa · October 10, 2021, 7:54am

That is not supported. Note that PTX is not what is executed by the GPU. PTX is a virtual ISA and intermediate compiler representation, and has to be compiled to machine code (SASS) for a particular GPU architecture. This compilation can be performed either offline with the ptxas component of the toolchain (nvcc invokes it under the hood), or online using the JIT-compiler component baked into the CUDA driver.

The closest you can get is to dynamically generate PTX code in host code, and then use the CUDA driver interface to JIT-compile and load the resulting binary. I am aware of several apps that use this approach to generate code on the fly in response to user inputs. Whether this approach is suitable and performant enough for your use case, I cannot venture to guess.

Topic		Replies	Views
Dynamic Kernel Function Runtime code generation CUDA Programming and Performance	17	25877	March 26, 2013
Example code using PTX CUDA Programming and Performance	6	9057	March 25, 2008
Some problems with inline PTX CUDA Programming and Performance	6	1883	March 6, 2013
Inline PTX Assembly CUDA Programming and Performance	0	2554	August 10, 2010
Understanding PTX, the Assembly Language of CUDA GPU Computing Technical Blog	2	105	August 17, 2025
self modifying code CUDA Programming and Performance	6	11186	April 16, 2008
Ptxas slow CUDA Programming and Performance cuda , kernel	35	2716	May 2, 2024
why CUDA 2.0 does not expose all PTX ISA 1.3 ? CUDA Programming and Performance	20	27889	November 5, 2008
asm inlining in CUDA code? CUDA Programming and Performance	5	6524	July 19, 2010
Assembly Optimization CUDA Programming and Performance	2	4417	May 25, 2009

Dynamic CodeGen inside kernel using Inline PTX

Related topics