BVH on CUDA ...Does it make any sense, anyway ?

Hello,
I’m currently working on a visibility project for a VR framework, in which I compute the visibility of M AABBs according to N pyramidal frustums, using CUDA.
Even with a lot of optimizations, results are not what I expected them to be. I was actually asking myself, if this application could benefit from using a space partitioning technique… such as BVH…
But does it make any sense in CUDA ? I’ve no idea of how I can use such data structures in CUDA… I’ve read some papers already… (“Fast BVH construction on GPUs” for example).
I’m very curious about how it can be implemented.
Any ideas anyone ?

It’s hard to say without knowing more about your algorithm.

It’s certainly possible to implement BVH traversal in CUDA , and if you have enough parallelism (i.e. enough traversals in parallel) you should be able to get good speedups over the CPU.

Using stackless traversal (skip indices) for the tree is probably the best, but you can also get decent performance using a stack in local memory. You should access the node data using texture since it’s being accessed randomly.

I have implemented both the construction algorithm from “Fast BVH construction on GPUs” and a stack-based traversal. On my GTX280, these yield very nice performance.

This method requires to use a stack, right ? Thus, you have to manage a stack on device memory. I could build my own implementation but, I’m running out of time as the deadline of the project is coming fast. Is there any implementation available of a device memory stack ? I guess that coding that thing would require some kind of memory management scheme for my stack… Sounds tricky :S I hope I will be able to finish it

Yes, you do require a stack. Implementing one is dead simple though: It’s simply an array of stack entries and a single integer to tell you how many entries are currently on the stack (the stack pointer). My stack entry has the following form:

struct __align__(16) StackEntry {

  uint32_t nodeIndex;

  float	a;

  float	b;

};

In each thread, a stack is allocated in local memory using:

StackEntry stack[MAX_DEPTH];

int stackSize = 0;

Finally, I use two macros to push and pop:

#define PUSH3(I, A, B)	 {\

							 stack[stackSize].nodeIndex = (I);\

							 stack[stackSize].a		 = (A);\

							 stack[stackSize].b		 = (B);\

							 ++stackSize;\

						   }

#define POP3(I, A, B)	  {\

							 --stackSize;\

							 (I) = stack[stackSize].nodeIndex;\

							 (A) = stack[stackSize].a;\

							 (B) = stack[stackSize].b;\

						   }

And that’s all there is to it… Note that the stack resides in local memory which is as slow as global memory. Putting it in shared memory is infeasible when you have hundreds of threads per block and there are just 16k of this kind of memory.

Thanks a lot… I’ll give it a try tonight :)

In my application I have N pyramidal frustums culling M AABBs structured in a BVH (built on CPU, but nevermind). Is it feasible to launch k threads (k > N), which read a common stack of culling task and process them ?

Like, multiple threads executing a culling task… Adding new task for children nodes… etc… With each thread synchronizing after each task stack modification.

Does it seem feasible to do it that way ? Or is it better to launch N threads, one for each frustum ?

In other way: Is using more synced threads over one big task list worse than using less threads using their own stack, so they don’t need to sync ?

Not sure if I put it clear enough, though External Media

Optimizing for CUDA involves a lot of black magic and wizardry. It is difficult to tell what will be optimal in your case.

For traversal, using one thread per traversal with no synchronization at all works great.

For building BVHs, however, many people use global task lists from which each builder thread can grab one job after another.

So, both zero-synchronization parallelization and global task queue have successfully been used in CUDA… and it remains up to you to try them both and find out what works best :). I would go with the simpler zero-synchronization approach for now though. It does work well for me and it is quicker to implement. If your time is tight, this might be the best way to go. But if you do try both, make sure to report back what works best…

In fact, there might be a simple way to avoid stack-based traversal. You can convert your binary tree (I’m using lame binary BVH) to Threaded Binary Tree with skip links. Then you can encode this TBT into an array. The traversal order is directly encoded into the nodes. When the intersection is positive on the current node, just go to the next node. When it’s negative, just go to the node pointed by the skiplink (which is an index in the node array).

Like described here: http://www.cs.utah.edu/~bes/papers/fastRT/paper-node12.html

This way you don’t need recursion… and thus no stack or task-queue.

I don’t have any idea about the performance of this method, though…

Roped trees are known as one of the ways to do stackless traversal. It just so happens that for me, any stackless traversal I tried was slower than stack-based.