Dynamic Kernel Function Runtime code generation

Hi, I would like to dynamically create the kernel function and update the content dynamically time to time…

Initially my application will run wity the following kernal function

global void MY_Kernel(float Param1, float Param2)
Param1 = A + B * C / Param2;

Later it will be

global void MY_Kernel(float Param1, float Param2)
Param1 = sqrt(E) / B * C + Param2;


Function name will be same always… PArameter types and numbers will be same… Just the content of the function will be defined by the user during runtime…

Is it possible? what are the steps to implement such application?

Okey in some how I can generate PTX file with the help of PTX…PDF file. But how I will dynamically load into GPU during runtime etc.

Use the driver API instead of the runtime API. See the documentation in the programming guide. I’ve never used the driver API, but I’m fairly sure it can do what you need done.

First, you should get your code compiled into .cubin file. It is not required to produce .ptx file, you can use usual .cu files. You can compile .cu file with “nvcc -cubin” or .ptx file with ptxas.

Next, you need to load .cubin file onto GPU. This can be done with low-level CUDA Driver API (see Programming Manual): cuModuleLoad() will load device code from your newly generated .cubin file.

You’re probably using Runtime API so you have to rewrite your program using Driver API (see 4.5.3 in Programming Manual).

what about to use the “call” ptx instruction ? The PTX_ISA_1.0 guide explicitely allow the call a address which is stored in a register :

“The called location can be either a symbolic function name or an address held in a

so it should be possible to dynamic generate code and then jump to the position of this code by a pointer ?

or do i understood something wrong ?


I guess ‘call’ instructions are inlined at compile time. GPU do not have stack (so functions do not know return address) and I’m not sure if GPUs have something like IP (i.e. is there something like ‘pointer to code’).

And keep in mind that PTX operates ‘virtual’ registers, not real ones.

Actually, the G80 has an internal stack. It’s only used for return addresses, and you cannot use it for parameters or function-local variables. But the real call instruction exists. It’s never produced by nvcc (afaik), but it is produced by ptxas internally if you use rare instructions like div.u32, div.s32.
(these seem to be pre-made microcode appended to your kernel)

I don’t think you can call to a dynamic address “call to a register”, I’ve seen only static call instructions in cubin.

To generate PTX code on the fly and compile it (ie, runtime compilation) you don’t need dynamic call instructions though :) The CUDA runtime API allows for this without any special magic. There is a method to load a cubin module dynamically, a method to look up function objects in this modules, and you can invoke kernels by function object. All on the host.

does it means that i’m able to load .cubin files for each thread so i can
execute different code on each thread.

I’m working on a virtual machine for a scripting language which runs on the GPU. Right now the opcodes are just interpreted and it would be great if the “call” instruction can invoke direct address with ptx instructions.

The IPX spec describes that the call instruction stores the address of the next instruction and the execution can resume at that point after the RET instruction.


Due to the nature of CUDA you cannot execute different code for each thread, maybe for each warp of 32 threads, or each block, but not each thread.

And you can only run 1 kernel at a time on the GPU. Any different operations among different warps must be done via branching.

agree, only one kernel. but if use the ptx instruction “call” in a single thread what will happen ?

e.g pseudo code :

global kernel(void) {

switch (tid)

case (1) : ptx.call addr_1; break;
case (2) : ptx.call addr_2; break;
case(n) : ptx.call addr_n; break;


The current warp size is 32. That means that 32 Threads of one block executes concurently, or at leat nearly-concurently.

But what if all the Threads in a Block follows different paths ? Does they all need to be serialized ?


Yes. Threads in a warp that follow different branches are serialized in the hardware.

and there’s (as far as I know) no benefit to using switch instead of old fashioned if/else statements in CUDA. Unlike on x86 architectures it is not implemented using a jump table.

hmm, what exactly means serialization. does it means that the the first thread which branch need to finish until the next can continue or when the first thread access e.g a global memory ( i read here something about 600 clocks latency) the next thread can execute in the meantime ?

guys, i’m really confused about the “parallel” term and the thread, block, warp thing. read a lot here in the forum but still not 100% sure about all the details.


thank you for the decuda tool. it really helps to understand things which are simply not documented !

Warps with threads that follow different branches are called “divergent warps”. Read the programming guide and you will know as much as any of us. Nobody except NVIDIA engineers really know how the hardware handles the situation.

From my experimentation, I can add that “a little” divergence doesn’t really seem to change the performance. i.e. a simple if (2-way divergence) or threads that loop for different numbers of iterations. However the hardware handles these situations, it does a very good job.

What you are proposing seems to be full 32-way divergence for every single warp which I doubt could possibly be done efficiently.

“Parallel” means many things on the GPU. Don’t worry about warps or blocks at first when trying to wrap your head around a GPU algorithm, just start with the threads. GPU’s implement a data-parallel paradigm, which means you perform exactly the same set of instructions on a large number of data elements (10’s of thousands to millions or more). You need to imagine that EVERY SINGLE thread is being calculated at exactly the same time. If you can cast your algorithm into this form, it should work very well on the GPU.

After you see this, then all of the details of blocks and warps, the interleaved execution for data latency hiding, etc… all just become implementation details. They are sometimes important for performance reasons (especially memory access patterns), but the same basic picture remains. A massive number of independent, but identical, calculations are being performed each on different data values.

what about passing classtype objects as a parameter of kernel function?

is that possible in CUDA?

maybe thats what topicstarter can use.

i mean smthng like:[codebox]class myclass{

public: void f(){…}


global void mykernel(myclass* mc){



int main(…){

myclass mc();


return 0;


This is what I want to do… But you mean the “driver API”, not the runtime API above, right? I have only found driver API functions for dealing with modules and runtime generated code.

Does this mean that I have to convert all my runtime API code into driver API code in order to use the module functions, or is it safe to combine the two API’s in this case? (I know the manual says they shouldn’t be mixed)


Simple swicth-case thing may not be useful in most cases.

To be able to do truly dynamic computations, the kernel may need changes (maybe from the user).
Like DirectX and OpenGL, where the shaders can be compiled and then run on the GPU, compute kernels could also be dynamically generated and compiled under CUDA.

OpenCl looks really promising to me in this regard. It supports dynamic kernel input and compilation/execution on the run.

Hy! I’am writing an AI program (it’s my diploma work), that generate and run many GPGPU program under there runtime. This genereated AI code(s) depend on the user’s parameters. That parameters are parts of c++ or c codes. After that code writed, it have to compile with Cuda compiler and run at the GPU. I use now OpenCl (because online compilation), but the amd videocards are very slow (the new too! and they are very lazy or just talentless? to accelerate ther OpenCl codes), therefore i’m using Nvidia card, and it not necesary to run my code under other platforms (the online compilation, the first OpenCL instruction, the double precision float point codes are very fast with the Geforce cards). The Cuda will be the best for me, but without the online compilation i can’t use it for my project. Is that any possibility to online compilation with Cuda? Or will by some time any functions, dlls or libs to do Cuda online compilation? Sorry my english.