Dynamic Kernel Function Runtime code generation

mkasap · November 4, 2007, 3:57pm

Hi, I would like to dynamically create the kernel function and update the content dynamically time to time…

i.e.
Initially my application will run wity the following kernal function

global void MY_Kernel(float Param1, float Param2)
{
Param1 = A + B * C / Param2;
}

Later it will be

global void MY_Kernel(float Param1, float Param2)
{
Param1 = sqrt(E) / B * C + Param2;
}

etc…

Function name will be same always… PArameter types and numbers will be same… Just the content of the function will be defined by the user during runtime…

Is it possible? what are the steps to implement such application?

Okey in some how I can generate PTX file with the help of PTX…PDF file. But how I will dynamically load into GPU during runtime etc.

MisterAnderson42 · November 4, 2007, 7:22pm

Use the driver API instead of the runtime API. See the documentation in the programming guide. I’ve never used the driver API, but I’m fairly sure it can do what you need done.

AndreiB · November 4, 2007, 7:23pm

First, you should get your code compiled into .cubin file. It is not required to produce .ptx file, you can use usual .cu files. You can compile .cu file with “nvcc -cubin” or .ptx file with ptxas.

Next, you need to load .cubin file onto GPU. This can be done with low-level CUDA Driver API (see Programming Manual): cuModuleLoad() will load device code from your newly generated .cubin file.

You’re probably using Runtime API so you have to rewrite your program using Driver API (see 4.5.3 in Programming Manual).

allvano · November 7, 2007, 7:57pm

what about to use the “call” ptx instruction ? The PTX_ISA_1.0 guide explicitely allow the call a address which is stored in a register :

“The called location can be either a symbolic function name or an address held in a
register.”

so it should be possible to dynamic generate code and then jump to the position of this code by a pointer ?

or do i understood something wrong ?

regards,
jj

AndreiB · November 7, 2007, 8:48pm

I guess ‘call’ instructions are inlined at compile time. GPU do not have stack (so functions do not know return address) and I’m not sure if GPUs have something like IP (i.e. is there something like ‘pointer to code’).

And keep in mind that PTX operates ‘virtual’ registers, not real ones.

wumpus · November 7, 2007, 10:08pm

Actually, the G80 has an internal stack. It’s only used for return addresses, and you cannot use it for parameters or function-local variables. But the real call instruction exists. It’s never produced by nvcc (afaik), but it is produced by ptxas internally if you use rare instructions like div.u32, div.s32.
(these seem to be pre-made microcode appended to your kernel)

I don’t think you can call to a dynamic address “call to a register”, I’ve seen only static call instructions in cubin.

To generate PTX code on the fly and compile it (ie, runtime compilation) you don’t need dynamic call instructions though :) The CUDA runtime API allows for this without any special magic. There is a method to load a cubin module dynamically, a method to look up function objects in this modules, and you can invoke kernels by function object. All on the host.

allvano · November 8, 2007, 9:38am

does it means that i’m able to load .cubin files for each thread so i can
execute different code on each thread.

I’m working on a virtual machine for a scripting language which runs on the GPU. Right now the opcodes are just interpreted and it would be great if the “call” instruction can invoke direct address with ptx instructions.

The IPX spec describes that the call instruction stores the address of the next instruction and the execution can resume at that point after the RET instruction.

regards,
jj

wumpus · November 8, 2007, 11:52am

Due to the nature of CUDA you cannot execute different code for each thread, maybe for each warp of 32 threads, or each block, but not each thread.

MisterAnderson42 · November 8, 2007, 1:08pm

And you can only run 1 kernel at a time on the GPU. Any different operations among different warps must be done via branching.

allvano · November 8, 2007, 1:30pm

agree, only one kernel. but if use the ptx instruction “call” in a single thread what will happen ?

e.g pseudo code :

global kernel(void) {

switch (tid)

case (1) : ptx.call addr_1; break;
case (2) : ptx.call addr_2; break;
case(n) : ptx.call addr_n; break;

}

The current warp size is 32. That means that 32 Threads of one block executes concurently, or at leat nearly-concurently.

But what if all the Threads in a Block follows different paths ? Does they all need to be serialized ?

regards,
jj

MisterAnderson42 · November 8, 2007, 2:25pm

Yes. Threads in a warp that follow different branches are serialized in the hardware.

wumpus · November 8, 2007, 2:34pm

and there’s (as far as I know) no benefit to using switch instead of old fashioned if/else statements in CUDA. Unlike on x86 architectures it is not implemented using a jump table.

allvano · November 8, 2007, 2:47pm

hmm, what exactly means serialization. does it means that the the first thread which branch need to finish until the next can continue or when the first thread access e.g a global memory ( i read here something about 600 clocks latency) the next thread can execute in the meantime ?

guys, i’m really confused about the “parallel” term and the thread, block, warp thing. read a lot here in the forum but still not 100% sure about all the details.

@wumpus

thank you for the decuda tool. it really helps to understand things which are simply not documented !

MisterAnderson42 · November 8, 2007, 3:40pm

Warps with threads that follow different branches are called “divergent warps”. Read the programming guide and you will know as much as any of us. Nobody except NVIDIA engineers really know how the hardware handles the situation.

From my experimentation, I can add that “a little” divergence doesn’t really seem to change the performance. i.e. a simple if (2-way divergence) or threads that loop for different numbers of iterations. However the hardware handles these situations, it does a very good job.

What you are proposing seems to be full 32-way divergence for every single warp which I doubt could possibly be done efficiently.

“Parallel” means many things on the GPU. Don’t worry about warps or blocks at first when trying to wrap your head around a GPU algorithm, just start with the threads. GPU’s implement a data-parallel paradigm, which means you perform exactly the same set of instructions on a large number of data elements (10’s of thousands to millions or more). You need to imagine that EVERY SINGLE thread is being calculated at exactly the same time. If you can cast your algorithm into this form, it should work very well on the GPU.

After you see this, then all of the details of blocks and warps, the interleaved execution for data latency hiding, etc… all just become implementation details. They are sometimes important for performance reasons (especially memory access patterns), but the same basic picture remains. A massive number of independent, but identical, calculations are being performed each on different data values.

muxa · November 27, 2008, 11:40pm

what about passing classtype objects as a parameter of kernel function?

is that possible in CUDA?

maybe thats what topicstarter can use.

i mean smthng like:[codebox]class myclass{

public: void f(){…}

};

global void mykernel(myclass* mc){

…

mc->f();

…

}

int main(…){

…

myclass mc();

mykernel<<<…>>>(&mc);

…

return 0;

}[/codebox]

lars · December 29, 2008, 11:00am

This is what I want to do… But you mean the “driver API”, not the runtime API above, right? I have only found driver API functions for dealing with modules and runtime generated code.

Does this mean that I have to convert all my runtime API code into driver API code in order to use the module functions, or is it safe to combine the two API’s in this case? (I know the manual says they shouldn’t be mixed)

/Lars

anoop_thomas · August 19, 2009, 5:17am

Simple swicth-case thing may not be useful in most cases.

To be able to do truly dynamic computations, the kernel may need changes (maybe from the user).
Like DirectX and OpenGL, where the shaders can be compiled and then run on the GPU, compute kernels could also be dynamically generated and compiled under CUDA.

OpenCl looks really promising to me in this regard. It supports dynamic kernel input and compilation/execution on the run.

szab · March 26, 2013, 7:39pm

Hy! I’am writing an AI program (it’s my diploma work), that generate and run many GPGPU program under there runtime. This genereated AI code(s) depend on the user’s parameters. That parameters are parts of c++ or c codes. After that code writed, it have to compile with Cuda compiler and run at the GPU. I use now OpenCl (because online compilation), but the amd videocards are very slow (the new too! and they are very lazy or just talentless? to accelerate ther OpenCl codes), therefore i’m using Nvidia card, and it not necesary to run my code under other platforms (the online compilation, the first OpenCL instruction, the double precision float point codes are very fast with the Geforce cards). The Cuda will be the best for me, but without the online compilation i can’t use it for my project. Is that any possibility to online compilation with Cuda? Or will by some time any functions, dlls or libs to do Cuda online compilation? Sorry my english.

Topic		Replies	Views
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	626	June 3, 2024
Is dynamic code generation possible? CUDA Programming and Performance	20	4547	December 29, 2012
CUDA Kernel self-suspension ? Can a CUDA Kernel conditionally suspend its execution ? CUDA Programming and Performance	46	45212	April 17, 2011
Determining correct compute capability for a loaded PTX file/kernel ? CUDA Programming and Performance	10	2610	February 11, 2015
Going to learn PTX and write a GPU compiler CUDA Programming and Performance	20	26859	January 19, 2009
Dynamic CodeGen inside kernel using Inline PTX CUDA Programming and Performance	2	662	October 10, 2021
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23533	March 21, 2011
'Computations server' application design advice CUDA Programming and Performance	24	12675	March 23, 2007
NVCC at Runtime - End User Friendly Configuration Compiling GPU code without requiring Visual Studio CUDA Programming and Performance	16	10440	June 19, 2009

Dynamic Kernel Function Runtime code generation

Related topics