Hi,
I know this is quite a complex question, but i was wondering if there was any possible way to generate asm dynamically inside the kernel execution, to be executed later inside the same Thread.
The motivation behind this is that I want to execute huge amounts of different programs extremely efficiently (for an AI framework), but each program logic is different, so I tried to compile a new different kernel for each program, but the compile times were way bigger than the execution times, so most of the time the GPU was idling.
Another option was using a single kernel who would load an int array from global memory whose values represented the program logic, for instance, the value 0, meant “sum”, and 1 meant “multiply”, and so on, but the problem was the massive overhead from reading and comparing values from memory, resulting in poor performance.
So the solution would be to use Inline PTX to dynamically create the instructions for the program without the need of recompiling anything.
Something like this:
__global__
void computeCustomGraph(inputs...)
{
// using Inline PTX, create some instructions and store them on an array or something similar...
int instructions[1024] = ...
for(int i=0; i<10000; i++){
//execute the instructions created previously
}
}```
But the issue is that I'm not sure what kind of instruction could be used to add more instructions to the execution, or how to run instructions stored in an array.
Any ideas?