CUDA recursion failed

rinsavs · November 11, 2016, 4:30pm

Hello, everyone. Please help.

I am creating a ray tracing application that require recursions. However, the first recursion worked, but the second time doesn’t. Below is the structure of my functions

device void BVHIntersection(params){
if(condition){
//some code
}else{
BVHIntersection(params_1);
BVHIntersection(params_2);
}
}

device float intersection(params, rec){
//some code
BVHIntersection(params);
//some other code
if(reflective and rec < 1) //rec is the stopping condition
float *la = intersection(params_1, rec + 1);
}

The first calling of intersection:
rec = 0
recursion on BVHIntersection called nicely

The second one (in the reflective part)
call intersection again with rec = 1 → called
recursion on BVHIntersection cannot be called. //this is the problem

any help will be highly appreciated.
Thanks in advance

P.S. I cannot start my CUDA debugging. I have given the free(var) command for every malloc

MutantJohn · November 11, 2016, 5:15pm

GPU threads have relatively small stacks so my first advice would be to avoid recursion in device functions unless you can guarantee that the recursive depth is capped at something small enough to comfortably fit in the stack.

What do you mean by “recursion on BVHIntersection cannot be called”? I’m assuming this also stands for bounding volume hierarchy?

rinsavs · November 11, 2016, 11:19pm

Hello, MutantJohn. Thanks for your reply!
So, is loop better than recursion?

Yes, it is Bounding Volume Hierarchy. I tried to printf something before recursion and it was executed. But, when I tried to print something after the recursion called, it was not executed. I thought that it was something about the function (in recursion) cannot be called

SPWorley · November 12, 2016, 1:52am

In general, traversing a bounding volume heirarchy is more efficient to query as a stack of nodes to process as opposed to the use of conceptually elegant recursion. They end up doing the same work in the same order, but the custom work queue is more efficient in memory (often it’s only storing set of pointers, not a whole reference frame of data), in computation (you’re only updating a pointer, not reloading a frame of data), and in versatility (you may find your goal intersection and you want to exit. With a stack you just release the stack and return, with recursion the state has to be popped up each level.) This is generally true on both CPU and GPU, and for both pointlike lookups (say for particle system searches) and raytracing (which need ordered intersections of a ray traversal). CPUs are far far more efficient at recursion than GPUs since they just shift reference frame data with a stack pointer update and let the L1 cache lazily hide most of the expense. The GPU needs to actually actively copy its data, which can benefit from caching but is still less efficient.

But this doesn’t get into the enormously important GPU complexity when you consider divergence and trying to share node access over (hopefully) similar thread queries. The majority of the fastest GPU raytracing tricks involve actively shuffling work between threads to make each warp’s threads ideally all access the same node/object/particles/rays at the same time. Clean, short, elegant, straightforward traversals like you see for CPU pseudocode in books tends to quickly turn into to fully divergent GPU threads, meaning your warps are computing at only 1/32 throughput.

In my own GPU raytracer, I have one traversal stack per WARP and multiple rays per thread to allow the threads to opportunistically find the “best ray” to apply to a retrieved node. It gets complex, which is why Optix is a very useful library where NVidia has done a spectacular job in making an efficient raytracing API. You’re getting pretty deep into very advanced architecture aware software design if you’re making your own GPU tracer from scratch, unlike a CPU raytracer design which is, in its simplest form, on the difficulty scale of an elegant homework assignment.

None of this is directly answering your question about recursion on GPUs, but MutantJohn nailed it with his summary of it being supported, but with a much more limited local stack, and to be avoided when possible for both depth and efficiency reasons. I’ll add to that that many recursive calls are actually tail recursion, solved more efficiently by a loop instead of a recursive evaluation, which is better on both CPU and GPU. This is common even in heirarchical node searches, especially point-based queries for particle systems. Even a forking recursion (like your query) can replace two recursive calls per level with one recursive call and a (free) tail recursion, which is still recursive, but uses less resources.

rinsavs · November 13, 2016, 2:42am

Hello, SPWorley. Thanks for your reply.
Can you please explain it in more “human” language? Because I’m really a newbie in CUDA and Ray Tracing. ._.

Anyway, changed my BVHIntersection into loop and it worked :)

njuffa · November 13, 2016, 3:52am

I don’t want to curb your enthusiasm, but it seems to me that you might want to hone your CUDA skills on slightly less ambitious projects, before embarking on a very complex one, like a parallelized raytracer.

When I was a young lad, there were those who embarked on writing an operating system a week after making first contact with C, and that rarely worked out :-)

rinsavs · November 13, 2016, 4:05am

Hello, njuffa :)
The problem is, the ray tracer is my final bachelor project and I have defined it in my project environment. No way back ._.

SPWorley · November 13, 2016, 4:54am

My post was already a high level simplified summary of the common method of traversal algorithm data structure. As njuffa says, you will have a challenge to learn both CUDA and raytracing simultaneously, especially since raytracing is a real algorithmic challenge to do efficiently in SIMT architectures like CUDA.

If you are locked into the project of raytracing on GPUs, you might considerably simplify your work by implementing a raycaster, not raytracer, which eliminates the recursive work allocation for rays. And use procedural geometry, not polygon lists, which eliminates recursive data structure searches. These two simplifications work very well together and also fit the GPU architecture well since divergence is minimized. It is much easier to understand and program, yet can still produce very impressive realtime results. It can be implemented in CUDA, but most people in this niche just use OpenGL shading language. CUDA implementation is very similar.

I’ll give you two url=https://www.shadertoy.com/view/XsX3RB[/url] url=https://www.shadertoy.com/view/ld3Gz2[/url], each of which uses only 300 lines of quite readable and hackable OpenGL shader code.

rinsavs · November 13, 2016, 6:41am

Thanks for the examples, SPWorley. I’m reading them now.
I cannot change it into raycaster either… since I’ve defined it in my project environment. And I use triangles…
My BVH problem has been solved by changing it into loop :)
By the way, does cudaMemcpy has max size?

SPWorley · November 13, 2016, 7:20am

You might find this useful.

rinsavs · November 13, 2016, 2:22pm

Hello, SPWorley. I just read the article. But, does CUDA support OOP?

rinsavs · November 13, 2016, 2:24pm

And I also ask for help about this topic
https://devtalk.nvidia.com/default/topic/976384/cuda-unspecified-launch-failure-on-cudamemcpy-and-cudafree/
Thanks a lot…

Topic		Replies	Views
Does CUDA 2.0 support recursion? ... and ray tracing doubts. CUDA Programming and Performance	10	30020	March 6, 2009
Map algorithms from CPU-GPU: recursive ans stack CUDA Programming and Performance	15	14358	August 12, 2008
does CUDA support recursion? CUDA Programming and Performance	2	5383	March 10, 2015
Recursion in Cuda 3.1 CUDA Programming and Performance	1	9738	July 12, 2010
Alternative for recursion? CUDA Programming and Performance	4	4160	May 22, 2007
Stack underflow and recursion CUDA Programming and Performance	2	3334	May 17, 2011
GPU Cuda program does not work with recursive calls CUDA Programming and Performance	0	1478	October 26, 2013
Does CUDA support recursion? CUDA Programming and Performance	0	598	March 9, 2015
CUDA Ray Tracing - error when mesh's faces are a lot CUDA Programming and Performance	5	919	December 7, 2016
BVH on CUDA ...Does it make any sense, anyway ? CUDA Programming and Performance	8	19570	June 17, 2009

CUDA recursion failed

Related topics