Thinking Parallel, Part II: Tree Traversal on the GPU

Originally published at:

In the first part of this series, we looked at collision detection on the GPU and discussed two commonly used algorithms that find potentially colliding pairs in a set of 3D objects using their axis-aligned bounding boxes (AABBs). Each of the two algorithms has its weaknesses: sort and sweep suffers from high execution divergence, while…

Hi! Have anybody managed to get close to 0.25ms for BVH tree traversal timings stated above? My implementation of traverseIterative on GPU is extremely slow (>100ms) which is very surprising as implementation of algorithms from part iii gives very close timings to those declared in paper. I'm testing a 18K triangles scene, which has about 50K potential intersections. If anybody can help with advice, I can provide my source code.

Why the if-else in "traverseIterative" does not result in divergent? I can't understand this.

Are you referring to the piece I quote below?

if (!traverseL && !traverseR)
node = *--stackPtr; // pop
node = (traverseL) ? childL : childR;
if (traverseL && traverseR)
*stackPtr++ = childR; // push

I think that this causes execution to diverge, but then probably it can converge again at the end of the else block. That is only my speculation though... I too would like to see somebody explain more fully when execution divergence does or does not happen. I imagine it will vary from one piece of hardware to another, and/or from one compiler to another...

I'm down to 2.5 ms or so. Not what's claimed above (I need to optimize something it seems) but better than 100 ms.