BVH on CUDA ...Does it make any sense, anyway ?

Buanderie · May 16, 2009, 4:33pm

Hello,
I’m currently working on a visibility project for a VR framework, in which I compute the visibility of M AABBs according to N pyramidal frustums, using CUDA.
Even with a lot of optimizations, results are not what I expected them to be. I was actually asking myself, if this application could benefit from using a space partitioning technique… such as BVH…
But does it make any sense in CUDA ? I’ve no idea of how I can use such data structures in CUDA… I’ve read some papers already… (“Fast BVH construction on GPUs” for example).
I’m very curious about how it can be implemented.
Any ideas anyone ?

Simon_Green · May 18, 2009, 6:21pm

It’s hard to say without knowing more about your algorithm.

It’s certainly possible to implement BVH traversal in CUDA , and if you have enough parallelism (i.e. enough traversals in parallel) you should be able to get good speedups over the CPU.

Using stackless traversal (skip indices) for the tree is probably the best, but you can also get decent performance using a stack in local memory. You should access the node data using texture since it’s being accessed randomly.

tolmaion · May 19, 2009, 2:29pm

I have implemented both the construction algorithm from “Fast BVH construction on GPUs” and a stack-based traversal. On my GTX280, these yield very nice performance.

Buanderie · June 2, 2009, 8:51am

This method requires to use a stack, right ? Thus, you have to manage a stack on device memory. I could build my own implementation but, I’m running out of time as the deadline of the project is coming fast. Is there any implementation available of a device memory stack ? I guess that coding that thing would require some kind of memory management scheme for my stack… Sounds tricky :S I hope I will be able to finish it

tolmaion · June 2, 2009, 11:18am

Yes, you do require a stack. Implementing one is dead simple though: It’s simply an array of stack entries and a single integer to tell you how many entries are currently on the stack (the stack pointer). My stack entry has the following form:

struct __align__(16) StackEntry {

  uint32_t nodeIndex;

  float	a;

  float	b;

};

In each thread, a stack is allocated in local memory using:

StackEntry stack[MAX_DEPTH];

int stackSize = 0;

Finally, I use two macros to push and pop:

#define PUSH3(I, A, B)	 {\

							 stack[stackSize].nodeIndex = (I);\

							 stack[stackSize].a		 = (A);\

							 stack[stackSize].b		 = (B);\

							 ++stackSize;\

						   }

#define POP3(I, A, B)	  {\

							 --stackSize;\

							 (I) = stack[stackSize].nodeIndex;\

							 (A) = stack[stackSize].a;\

							 (B) = stack[stackSize].b;\

						   }

And that’s all there is to it… Note that the stack resides in local memory which is as slow as global memory. Putting it in shared memory is infeasible when you have hundreds of threads per block and there are just 16k of this kind of memory.

Buanderie · June 2, 2009, 1:47pm

Thanks a lot… I’ll give it a try tonight :)

In my application I have N pyramidal frustums culling M AABBs structured in a BVH (built on CPU, but nevermind). Is it feasible to launch k threads (k > N), which read a common stack of culling task and process them ?

Like, multiple threads executing a culling task… Adding new task for children nodes… etc… With each thread synchronizing after each task stack modification.

Does it seem feasible to do it that way ? Or is it better to launch N threads, one for each frustum ?

In other way: Is using more synced threads over one big task list worse than using less threads using their own stack, so they don’t need to sync ?

Not sure if I put it clear enough, though External Media

tolmaion · June 2, 2009, 1:55pm

Optimizing for CUDA involves a lot of black magic and wizardry. It is difficult to tell what will be optimal in your case.

For traversal, using one thread per traversal with no synchronization at all works great.

For building BVHs, however, many people use global task lists from which each builder thread can grab one job after another.

So, both zero-synchronization parallelization and global task queue have successfully been used in CUDA… and it remains up to you to try them both and find out what works best :). I would go with the simpler zero-synchronization approach for now though. It does work well for me and it is quicker to implement. If your time is tight, this might be the best way to go. But if you do try both, make sure to report back what works best…

Buanderie · June 17, 2009, 1:18pm

In fact, there might be a simple way to avoid stack-based traversal. You can convert your binary tree (I’m using lame binary BVH) to Threaded Binary Tree with skip links. Then you can encode this TBT into an array. The traversal order is directly encoded into the nodes. When the intersection is positive on the current node, just go to the next node. When it’s negative, just go to the node pointed by the skiplink (which is an index in the node array).

Like described here: http://www.cs.utah.edu/~bes/papers/fastRT/paper-node12.html

This way you don’t need recursion… and thus no stack or task-queue.

I don’t have any idea about the performance of this method, though…

tolmaion · June 17, 2009, 1:35pm

Roped trees are known as one of the ways to do stackless traversal. It just so happens that for me, any stackless traversal I tried was slower than stack-based.

Topic		Replies	Views
CUDA recursion failed CUDA Programming and Performance	11	3675	November 13, 2016
Is it doable with CUDA? CUDA Programming and Performance	6	563	December 29, 2019
Some questions about CUDA and my problem CUDA Programming and Performance	3	2348	June 18, 2008
Thread Synchronization CUDA Programming and Performance	4	8488	October 2, 2009
Collision Detection task partitioning strategy CUDA Programming and Performance	7	7185	November 10, 2008
Synchronization across all threads CUDA Programming and Performance	9	6593	August 22, 2008
Is this a task in which CUDA could speed up things? CUDA Programming and Performance	11	11749	September 29, 2009
CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops Technical Blog	18	1452	September 12, 2019
A CUDA Dynamic Parallelism Case Study: PANDA Technical Blog	2	345	March 16, 2017
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6032	December 8, 2008

BVH on CUDA ...Does it make any sense, anyway ?

Related topics