CUDA recursion failed

Hello, everyone. Please help.

I am creating a ray tracing application that require recursions. However, the first recursion worked, but the second time doesn’t. Below is the structure of my functions

device void BVHIntersection(params){
if(condition){
//some code
}else{
BVHIntersection(params_1);
BVHIntersection(params_2);
}
}

device float intersection(params, rec){
//some code
BVHIntersection(params);
//some other code
if(reflective and rec < 1) //rec is the stopping condition
float *la = intersection(params_1, rec + 1);
}

The first calling of intersection:
rec = 0
recursion on BVHIntersection called nicely

The second one (in the reflective part)
call intersection again with rec = 1 → called
recursion on BVHIntersection cannot be called. //this is the problem

any help will be highly appreciated.
Thanks in advance

P.S. I cannot start my CUDA debugging. I have given the free(var) command for every malloc

GPU threads have relatively small stacks so my first advice would be to avoid recursion in device functions unless you can guarantee that the recursive depth is capped at something small enough to comfortably fit in the stack.

What do you mean by “recursion on BVHIntersection cannot be called”? I’m assuming this also stands for bounding volume hierarchy?

Hello, MutantJohn. Thanks for your reply!
So, is loop better than recursion?

Yes, it is Bounding Volume Hierarchy. I tried to printf something before recursion and it was executed. But, when I tried to print something after the recursion called, it was not executed. I thought that it was something about the function (in recursion) cannot be called

In general, traversing a bounding volume heirarchy is more efficient to query as a stack of nodes to process as opposed to the use of conceptually elegant recursion. They end up doing the same work in the same order, but the custom work queue is more efficient in memory (often it’s only storing set of pointers, not a whole reference frame of data), in computation (you’re only updating a pointer, not reloading a frame of data), and in versatility (you may find your goal intersection and you want to exit. With a stack you just release the stack and return, with recursion the state has to be popped up each level.) This is generally true on both CPU and GPU, and for both pointlike lookups (say for particle system searches) and raytracing (which need ordered intersections of a ray traversal). CPUs are far far more efficient at recursion than GPUs since they just shift reference frame data with a stack pointer update and let the L1 cache lazily hide most of the expense. The GPU needs to actually actively copy its data, which can benefit from caching but is still less efficient.

But this doesn’t get into the enormously important GPU complexity when you consider divergence and trying to share node access over (hopefully) similar thread queries. The majority of the fastest GPU raytracing tricks involve actively shuffling work between threads to make each warp’s threads ideally all access the same node/object/particles/rays at the same time. Clean, short, elegant, straightforward traversals like you see for CPU pseudocode in books tends to quickly turn into to fully divergent GPU threads, meaning your warps are computing at only 1/32 throughput.

In my own GPU raytracer, I have one traversal stack per WARP and multiple rays per thread to allow the threads to opportunistically find the “best ray” to apply to a retrieved node. It gets complex, which is why Optix is a very useful library where NVidia has done a spectacular job in making an efficient raytracing API. You’re getting pretty deep into very advanced architecture aware software design if you’re making your own GPU tracer from scratch, unlike a CPU raytracer design which is, in its simplest form, on the difficulty scale of an elegant homework assignment.

None of this is directly answering your question about recursion on GPUs, but MutantJohn nailed it with his summary of it being supported, but with a much more limited local stack, and to be avoided when possible for both depth and efficiency reasons. I’ll add to that that many recursive calls are actually tail recursion, solved more efficiently by a loop instead of a recursive evaluation, which is better on both CPU and GPU. This is common even in heirarchical node searches, especially point-based queries for particle systems. Even a forking recursion (like your query) can replace two recursive calls per level with one recursive call and a (free) tail recursion, which is still recursive, but uses less resources.

Hello, SPWorley. Thanks for your reply.
Can you please explain it in more “human” language? Because I’m really a newbie in CUDA and Ray Tracing. ._.

Anyway, changed my BVHIntersection into loop and it worked :)

I don’t want to curb your enthusiasm, but it seems to me that you might want to hone your CUDA skills on slightly less ambitious projects, before embarking on a very complex one, like a parallelized raytracer.

When I was a young lad, there were those who embarked on writing an operating system a week after making first contact with C, and that rarely worked out :-)

Hello, njuffa :)
The problem is, the ray tracer is my final bachelor project and I have defined it in my project environment. No way back ._.

My post was already a high level simplified summary of the common method of traversal algorithm data structure. As njuffa says, you will have a challenge to learn both CUDA and raytracing simultaneously, especially since raytracing is a real algorithmic challenge to do efficiently in SIMT architectures like CUDA.

If you are locked into the project of raytracing on GPUs, you might considerably simplify your work by implementing a raycaster, not raytracer, which eliminates the recursive work allocation for rays. And use procedural geometry, not polygon lists, which eliminates recursive data structure searches. These two simplifications work very well together and also fit the GPU architecture well since divergence is minimized. It is much easier to understand and program, yet can still produce very impressive realtime results. It can be implemented in CUDA, but most people in this niche just use OpenGL shading language. CUDA implementation is very similar.

I’ll give you two url=https://www.shadertoy.com/view/XsX3RB[/url] url=https://www.shadertoy.com/view/ld3Gz2[/url], each of which uses only 300 lines of quite readable and hackable OpenGL shader code.

Thanks for the examples, SPWorley. I’m reading them now.
I cannot change it into raycaster either… since I’ve defined it in my project environment. And I use triangles…
My BVH problem has been solved by changing it into loop :)
By the way, does cudaMemcpy has max size?

You might find this useful.

Hello, SPWorley. I just read the article. But, does CUDA support OOP?

And I also ask for help about this topic
https://devtalk.nvidia.com/default/topic/976384/cuda-unspecified-launch-failure-on-cudamemcpy-and-cudafree/
Thanks a lot…