Originally published at: https://developer.nvidia.com/blog/thinking-parallel-part-ii-tree-traversal-gpu/

In the first part of this series, we looked at collision detection on the GPU and discussed two commonly used algorithms that find potentially colliding pairs in a set of 3D objects using their axis-aligned bounding boxes (AABBs). Each of the two algorithms has its weaknesses: sort and sweep suffers from high execution divergence, while…

Hi! Have anybody managed to get close to 0.25ms for BVH tree traversal timings stated above? My implementation of traverseIterative on GPU is extremely slow (>100ms) which is very surprising as implementation of algorithms from part iii gives very close timings to those declared in paper. I'm testing a 18K triangles scene, which has about 50K potential intersections. If anybody can help with advice, I can provide my source code.

Why the if-else in "traverseIterative" does not result in divergent? I can't understand this.

Are you referring to the piece I quote below?

if (!traverseL && !traverseR)

node = *--stackPtr; // pop

else

{

node = (traverseL) ? childL : childR;

if (traverseL && traverseR)

*stackPtr++ = childR; // push

}

I think that this causes execution to diverge, but then probably it can converge again at the end of the else block. That is only my speculation though... I too would like to see somebody explain more fully when execution divergence does or does not happen. I imagine it will vary from one piece of hardware to another, and/or from one compiler to another...

I'm down to 2.5 ms or so. Not what's claimed above (I need to optimize something it seems) but better than 100 ms.

In the section where the stack is explicitly managed during traversal, a hard-coded stack size of 64 was used. Where did the number 64 come from? Is there a bound on the number of nodes that will end up in the stack?

I think I may have determined the reason. In a complete binary tree, the maximum stack size required is equal to the depth of the tree. The max depth of the tree is limited to 64 since 2^64 is is the limit of the address space on 64 bit machines. Or at least this is the case if you use an array as the underlying implementation of a complete binary tree.