CUDA Recursive Call

Hi Everyone,
I am trying to call my kernel functions recursively. But I am getting Cuda Error 719, even if my program makes 1 recursive call. When I add new functions with same content and call them inside each other like making 4 recursive call, everything works fine. What is the problem?

Thanks in advance.
Canberk

Calling a kernel from a kernel is referred to as dynamic parallelism in CUDA. There are CUDA sample projects that demonstrate proper usage of dynamic parallelism, and there is a whole section of the programming guide dedicated to it.

However I’m not able to sort out the problem you are describing based on that description. My usual suggestion in these cases is to provide a short, complete example that demonstrates the problem. The most important word there is complete. I should be able to copy, paste, compile and run, and see the issue, without having to add anything or change anything. Do as you wish, of course.

In the occasions where I need recursive calls, I use the kernel function (the “global” one) as a wrapper for a device function, which implements the recursion. However, the parallelism I can achieve this way is very limited as I quickly encounter the error “too many resources requested for launch” when I increase the number of threads. The number of registers which are required for each thread seem to increase very steeply with the recursion level.
Also, I have often encountered invalid memory references, which I could solve by increasing the stack limit per thread. I guess that each device call comes with a significant overhead and that they cannot easily accumulated as much as on a CPU.

The sort of recursion you are describing is not based on CDP.

I’m not sure why that would be, in the general case. In the general case, the compiler has no knowledge of the recursion level (or depth) and therefore could not possibly be making register usage decisions based on recursion level.

Correct, increasing recursion depth will increase stack usage, which has to be accounted for.

Hi Robert,

I was unprecise in describing this observation. The issue happens at runtime. When I check cudaGetLastError after the kernel execution, I get too many resources requested for launch. Googling this error was not very successful, some posts that I found talk about the GPU running out of SMT registers, but I did not form a conclusive answer yet. However, the error message persists if I set the recursion level to one, so the issue is not related to recursion. I will probably create a new post about the issue once I get it into a small example.

The most common reason for too many resources requested for launch in my experience is that your (GPU kernel) code uses enough registers per thread, that when multiplied by the number of threads you are requesting to launch, exceeds the number of registers available on the GPU SM. There are other possibilities, though.

To assess whether my “most likely guess” is applicable, you would compile the file that contains the kernel in question using -Xptxas -v on the nvcc compile command line. Assuming you are running a situation where the kernel code is all in a single file, that will indicate how many registers per thread are used by each kernel in that file. If you multiply that number by the number of threads you are asking for at kernel launch time, you can compare the product to the hardware technical specification in the programming guide. As you can see there, all “current” GPUs offer 65536 registers per SM. Divide that number by the registers per thread reported by nvcc -Xptxas -v ... and you will get the maximum number of threads that can be launched, subject to granularity considerations. Any attempt to launch more than that many threads, for that kernel, will result in this error.

This general methodology along with suggestions to remedy is covered in various forum posts, such as here and here