Dynamic "ubershader" style kernals - static branching

Keldor314 · November 2, 2008, 5:33am

Is there any way to construct a kernal at runtime, given a large number of static branches like so:

if (transformation[0] != 0.0)
…
if (transformation[1] != 0.0)
…
…
if (transformation[n] != 0.0)
…

so that the branching is precalculated?

transformation is located in constant memory, and is only expected to change (with respect to zeroed components) a few times per minute. Also, transformation is typically rather sparse, with about 95% 0.0’s at any time.

In fact, generally only one or two of the transformations are non-zero for any given kernal.

The problem is that the transformations are applied in the inner loop, and with 50 or so transformations, branch computation becomes a significant bottleneck.

Basically I want to pick and choose which code segments to compute at a per kernal launch basis.

E.D_Riedijk · November 2, 2008, 7:34am

Is there any way to construct a kernal at runtime, given a large number of static branches like so:

if (transformation[0] != 0.0)

…

if (transformation[1] != 0.0)

…

…

if (transformation[n] != 0.0)

…

so that the branching is precalculated?

transformation is located in constant memory, and is only expected to change (with respect to zeroed components) a few times per minute. Also, transformation is typically rather sparse, with about 95% 0.0’s at any time.

In fact, generally only one or two of the transformations are non-zero for any given kernal.

The problem is that the transformations are applied in the inner loop, and with 50 or so transformations, branch computation becomes a significant bottleneck.

Basically I want to pick and choose which code segments to compute at a per kernal launch basis.

I think it is a matter of just trying a kernel like this. I think that you will not see a lot of overhead, since the transformations will be fast from global memory, and there is no divergence within a warp. If you do not need the value of tranformation I would make it a boolean though.

I have a kernel a bit like this that calculates 7 different averages in 1 kernel, where my grid is Nx7 big, so each block calculates a different average. That worked quite well with very little overhead.

If you find that there is a lot of overhead, you might be able to get this working with a template with a lot of parameters, but that will give you a lot of code I am afraid (but an optimal kernel)

MxAddict · November 2, 2008, 2:01pm

use bitmask and try to skip in packets, for example

constant unsigned long transformations;

global void kernel()
{
for(int i …)
{
if (transformations & 0xFF)
{
check from 0 to 7
}
if (transformations & 0xFF00)
{
check from 8 to 15
}
}
}

pixel/vertex shaders gets (PROBABLY) compilled (and cached) by driver when You change static branches,
for CUDA if you need real static branches, you need to do this on your own (have multiple versions of kernels already stored, or invoke nvcc and use driver api to load)
I dont think that hardware have something like ‘statich branches’ anyway.

alex_dubinsky · November 3, 2008, 2:06am

Try something like:

#pragma unroll

for(int i= 0; i< 50; ++i)

if (transformation[i] != 0.0)

{

   ...your code...

   for(int j= 0; j< 10000; ++j)

   {

		switch(i){

			case 0:

				 ...

			case 1:

				 ...

			case 2:

				 ...

		}

   }

   ...your code...

}

What should happen, if the compiler performs the unroll as it’s told, is that it will do the dirty work of duplicating your code then optimize the inner switch() to run the appropriate block of code for the current i.

CUDA doesn’t have any on-the-fly optimization/self-modifying code/etc. But what’s interesting is that since OpenCL is being based on LLVM, this platform might support this kind of runtime re-optimization.

Sarnath · November 3, 2008, 9:30am

Its a good thought Keldor.

Therez a company called “sci finance” who generate automatic CUDA code according to your inputs.

THat way, “Dynamic code” generation is not a bad idea – if you are sure that you gonna get performance.

Just write to a file @ run time , compile it into CUBIN and figure out a way of launching a kernel using a “cubin” file. – I think it is possible. Some expert in this forum should be able to show the way.

Topic		Replies	Views
Dynamic Kernel Function Runtime code generation CUDA Programming and Performance	17	25716	March 26, 2013
Wish List for next OpenCL release CUDA Programming and Performance	9	17453	September 9, 2009
Is dynamic code generation possible? CUDA Programming and Performance	20	4582	December 29, 2012
where to loop CUDA Programming and Performance	5	4003	December 11, 2007
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	678	June 3, 2024
Kernel Convolution with streams provides no benefit CUDA Programming and Performance	4	46	January 20, 2025
How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads()) CUDA Programming and Performance	15	77	August 21, 2024
Command line tools for building cuda kernels? CUDA Programming and Performance	6	797	May 21, 2018
CUDA thread in background? CUDA Programming and Performance	10	16021	February 19, 2010
Kernel enqueue overhead Bringing kernel overhead down? CUDA Programming and Performance	9	13761	March 12, 2010

Dynamic "ubershader" style kernals - static branching

Related topics