Why does vkCreateComputePipelines take so long for a recursive-like computer shader?

I tried this experiment on a NVIDIA GTX 560, and on a NVIDIA GTX 550 Ti.

I took the ray tracing from Sascha Willems.

I replaced the shader code by a path tracing shader.
I reused the C version for CPU from:

I did 2 tests:

  • A pure recursive version where the function calls itself
  • A version that copy-pastes the radiance() function 8 times to make a little recursion a few times but not pure. Each function calls the next function that calls the next function.
    A copy paste of 4 times will work without any problem. Therefore, the implementation in place is correct. However, if I add more copy pastes of teh function, then I get problems.

On Ubuntu LTS 16, both versions eat up all memory and freeze my mouse. Then I must restart my computer.
On Debian stable 8.6, it doesn’t freeze. It uses the swap with a nice 10 % memory left. But it takes too long.

Comment 1:
It seems to me that the drivers are not really ready to handle recursion that the SPIRV specification clearly allows for the OpFunction. It is like if this feature is not supported. Since OpenGL does not allow it, it is something new.

–> does the NVIDIA driver support the recusion of OpFunction, in a pure way and with copy pastes?
I took the right NVIDIA prorietary driver for debian as explained in the official debian page.

–> Why is Ubuntu not using the swap efficiently when running Vulkan? It seems to me that the greedinjess on memory is so fast that the OS loses control.

–> Why does the function vkCreateComputePipelines() take so long and why does it take much RAM?

–> Is there anything I can do to make this path tracing work?

code available here:

A pure recursion is definitively not allowed in Vulkan. While SPIR-V allows it, the Vulkan environment does not. See Vulkan Spec Appendix A.3: “Recursion: The static function-call graph for an entry point must not contain cycles”.

For the hardware itself, i am pretty sure no current GPU has the concept of a callstack, instead all function calls are fully inlined. In your copy-paste variant, each function has 7 places were it calls the next function, terminated at the 9th layer. If all of them are inlined, you end up with 7^8 = 5,764,801 copies of the function. If you terminate at the 4th layer, you have only 7^3 = 343 copies. So for 4 copies to work but for 9 to run out of memory seems reasonable. Even if you would be able to compile this with enough run, trying to run that monster on the GPU might still fail due to shader executable size.

More or less the best way to deal with this is to manually transform the code with an explicit, hand-coded stack. It seems like all recursion calls are in the style of return a + b * recursion() with a and b being vec3. So that means with 2 you would have return a0 + b0 * (a1 + b1 * rec()) which you can always split into the constant (a0 + b0 * a1) and the factor (b0 * b1) for the next recursion. So if you instead make a loop, and factor these 2 out and update them every iteration, you could transform this into a tail-recursion + 2 vec3 state, which means you could have arbitrarily deep recursion with constant-sized, fixed stack. Obviously you still want an upper loop limit to not get infinite render time.

Hope this helps.



Will the Vulkan environment ever allow recursive functions?
How can it work on the GPU without a call stack?

Why a default Ubuntu install cannot swap correctly for this experiment but Debian can?

Well, there is no stack on the GPU. Every variable is either stored in a register from the large register file (my GTX 970 has 65,536 32-bit registers in each of the 13 SMMs) or is spilled to shared memory (on my card 96KB per SMM). Since each SMM has hundreds of threads at the same time (again on my card it would be between 256-2048 threads with 100% occupancy), there actually is not that much memory that each thread has. If you take, for example, 512 running threads, each thread would have 128 32-bit registers and 192 bytes (up to 48 * 32-bit) shared memory. Together that’s less than 1KB of total memory, simply not enough space for a general stack. Also registers are “fixed”, if you have an array that you index dynamically or generally anything that might look like a pointer (i.e. stack pointer), you NEED to put it into shared memory.

And just a stack isn’t enough: While a function call is a jump to a fixed instruction address, a return is a jump to a dynamic instruction address loaded from register or memory. Since the GPU executes threads in lockstep (32 threads on NVIDIA), all threads have the same instruction pointer, but there is a bitmask that can “disable” threads, for example based on an if statement. So if 30 threads enter the if-block and 2 threads enter the else-block, all 32 threads execute BOTH blocks, but results/side-effects are discarded based on the bitmask. With this hardware, having different threads return to different parent functions is impossible. All threads would need to share the exact same recursion call tree, so all threads execute the MAXIMUM amount of recursion out of 32 threads.

No matter how you look at it, recursive functions are a misfit to GPUs. I doubt Vulkan will ever support them. Transforming your code AND your data-structures to fit the GPU is the right way. That also applies to the CPU btw, while recursion is supported its not really fast on modern CPUs either.


Although what you say is true for recursion in general, there is an exception.

Certain recursive algorithms can be optimized to not overload the stack.

By the way, I disliked the NVIDIA screen added to my boot when installing their proprietary driver.

Yes, you can optimize tail recursion. But you can easily transform tail recursion into a while loop.


ret func(...params...)
  ... some code with all recursive returns in the form
      "return func(...new params...)" ...


ret func(...params...)
    ... same code as above, but all recursive returns are
        transformed to "{ params = new params; continue; }" ...

The non-recursive returns stay the same. This is basically the optimization the compiler does. But there is no reason that the front-end compiler that generates your SPIR-V cannot do that. One of the ideas behind SPIR-V + Vulkan is that the compiler in the driver can be smaller and to push as much work as possible to the front-end compiler. This may be something to add to the GLSL compiler or the SPIR-V optimizer, but not something that the driver should do.