Is there any way to construct a kernal at runtime, given a large number of static branches like so:
if (transformation != 0.0)
if (transformation != 0.0)
if (transformation[n] != 0.0)
so that the branching is precalculated?
transformation is located in constant memory, and is only expected to change (with respect to zeroed components) a few times per minute. Also, transformation is typically rather sparse, with about 95% 0.0’s at any time.
In fact, generally only one or two of the transformations are non-zero for any given kernal.
The problem is that the transformations are applied in the inner loop, and with 50 or so transformations, branch computation becomes a significant bottleneck.
Basically I want to pick and choose which code segments to compute at a per kernal launch basis.
I think it is a matter of just trying a kernel like this. I think that you will not see a lot of overhead, since the transformations will be fast from global memory, and there is no divergence within a warp. If you do not need the value of tranformation I would make it a boolean though.
I have a kernel a bit like this that calculates 7 different averages in 1 kernel, where my grid is Nx7 big, so each block calculates a different average. That worked quite well with very little overhead.
If you find that there is a lot of overhead, you might be able to get this working with a template with a lot of parameters, but that will give you a lot of code I am afraid (but an optimal kernel)
use bitmask and try to skip in packets, for example
constant unsigned long transformations;
global void kernel()
for(int i …)
if (transformations & 0xFF)
check from 0 to 7
if (transformations & 0xFF00)
check from 8 to 15
pixel/vertex shaders gets (PROBABLY) compilled (and cached) by driver when You change static branches,
for CUDA if you need real static branches, you need to do this on your own (have multiple versions of kernels already stored, or invoke nvcc and use driver api to load)
I dont think that hardware have something like ‘statich branches’ anyway.
for(int i= 0; i< 50; ++i)
if (transformation[i] != 0.0)
for(int j= 0; j< 10000; ++j)
What should happen, if the compiler performs the unroll as it’s told, is that it will do the dirty work of duplicating your code then optimize the inner switch() to run the appropriate block of code for the current i.
CUDA doesn’t have any on-the-fly optimization/self-modifying code/etc. But what’s interesting is that since OpenCL is being based on LLVM, this platform might support this kind of runtime re-optimization.
Therez a company called “sci finance” who generate automatic CUDA code according to your inputs.
THat way, “Dynamic code” generation is not a bad idea – if you are sure that you gonna get performance.
Just write to a file @ run time , compile it into CUBIN and figure out a way of launching a kernel using a “cubin” file. – I think it is possible. Some expert in this forum should be able to show the way.