Shader-style 'uniform' bools? Does Cuda expand kernels with constant booleans into unique ke

D3D9 shaders can have ‘uniform bool’ inputs, the resulting code paths of which are expanded into unique shaders behind-the-scemes by the FX framework. This allows you to write a single shader with conditional paths, but which runs as optimal shaders without flow control.

(it’s the equivalent of instancing an inlined function with differing const bool arguments).

ie. if I supply a ‘const bool’ as a kernel argument, will the runtime equally split the code into optimised separate kernels?

If you put the test for the bool at the very start of your kernel, then yes… the runtime will use just a couple clocks to make the test once, but after that all your warps will run divergence-free.

If your code uses those bool arguments repetitively and deeply inside your kernel, you can optimize it by having the compiler create two versions for you. Then you can code with the boolean any way you like and the compiler will collapse all your tests for you, and you can switch at runtime.

template <bool myAlt>

__device__ void myRealKernel() 

{ ... code here, using myAlt any way you like }

__global__ mykernel(bool myAlt)

{

   if (myAlt) myRealKernel<true>();

   else myRealKernel<false>();

}

How exactly does that work internally? And is the restriction that it’s literally the first test, and only one nest deep? And does it strictly require a matching ‘else’?

Great, I wasn’t aware Cuda supported templates (I’ve only experiemented with it briefly so far). That’s the direct equivialent to D3D9 techniques.

Yes, templates will do exactly what you need. (‘constant bool’ won’t, and ‘const bool’ only sometimes.)

I should point out tho that unlike many cards of the D3D9 generation, CUDA has no limitations on flow control and only a very small performance loss when it is used uniformly (on the order of a few cycles per if statement).

It will work anywhere, as deep as you like, and inside any loops you like. But each time the runtime evaluates it, it will take a few cycles. Your question is about eliminating that repeated test overhead… so moving the test up helps make sure that it’s only done once. But you can trade off the position and multiple calls: code clarity for efficiency for example.

CUDA doesn’t support full C++, a large part because of the restrictions of memory address spaces, no taking addresses of registers, no dynamic memory allocation at runtime, etc. But if you keep your code simple, sort of “C with classes and templates” and avoid virtual inheritance, it works pretty well. Templates are especially useful and seem to work 100%, even template metaprogramming, which I’ve found useful for my GPU PRNG library.

What I meant was, what is the runtime actually doing behind-the-scenes? Are we talking about kernel-execution-level predication/dynamic flow control, or is the runtime actually evaluating the kernel before executing it on the device, and creating & uploading unique kernels for each code path? Where are the cycles spent?

It sounds like this is the equivalent of ‘static flow control’ in shaders, where you can set constant bool registers to toggle blocks of code (I was asking for templates, but I actually use both approaches in my D3D work because it keeps the number of shaders permutations manageable).

So that is the equivalent of static shader branching, right?

Actually, do you guys know of an introduction to Cuda that assumes a GFX API background? I need it less now, but a translation of concepts would have helped me make sense of it much quicker (and I’m sure it’s the same for other GFX programmers).

Dynamic branching is really efficient in the hardware, but I can think of one class of cases where it doesn’t fully map to static branching the way the template trick does. You said that the static shader branching results in the shader compiler re-building and re-optimizing the shader code depending on the bool, right?

In CUDA, if you wrote a kernel like this

if (my_bool)

… do some really simple task like a(i) = b(i) * 5

else

… do some really complicated task like multiply two matrices together …

endif

Let say that compiled separately, the simple task uses only 5 registers, and the complicated task uses 25 registers and 4k of shared memory. You should get this same “separate” behavior in the template trick statically branched kernel (note: I have not tried it, I’m not certain that the compiler is smart enough to optimized away unused shared memory declarations.)

The dynamically branched kernel will most likely end up using the greater of the two resource requirements. So running the simple kernel in the dynamically branched kernel will result in lower occupancy (and thus potentially lower performance) than the static one.

Anyways, this is all supposition and probably will end up being a less than ~10% performance difference even in the worst possible cases. But you did ask if it was equivalent, and it is technically not. Although for all practical purposes (especially where the dynamically branched kernels are of similar complexity) it basically is equivalent.

In D3D there are actually 3 ways to branch:

  1. Generating unique shaders at compile time (using ‘techniques’ & ‘uniform’ params).

  2. ‘Static’ flow control, which is a actually a run-time skipping of code-blocks, based on a constant boolean hardware register.

(as the flow is identical for the entire pass, this is said to be fairly efficient, but may hinder instruction optimisation).

  1. ‘Dynamic’ flow control. This is as a result of a run-time evaluation, thus varies unpredictably and can help or hit performance (if large pixel clusters take the same path it’s usually a win).

Cuda equivalents:

1 = templates

2 = … using const variables as kernel arguments?

3 = same, with similar performance implications (as the docs explain).

  1. Yup, templates. Also, stuff like if(true) or if(2>1) or if(const bool). The compiler can (usually?) perform obvious optimizations. This is all at compile time, and works like in C++.

  2. No such thing. Does not exist in DX10 cards.

  3. Ordinary if statements. Note: the “large pixel clusters” that do well with dynamic flow control are actually better defined in CUDA. The pixel cluster is called a warp, made of 32 threads (pixels), and you know exactly which threads are part of it (the first warp starts in the upper-left corner of the block and goes right, eventually wrapping to the next line. It’s “1D”, not a 2D cluster.). If all threads in the warp take the same path, then there’s only the couple of cycles of performance loss. If the warp diverges, still no big deal, but you have to wait for both sides to be calculated like in predication.

Predication, btw, is also supported, but not exposed at a C level. The compiler uses a rule something like “all if statements less than 7 instructions are rendered as predication, all larger blocks use true flow control.”