Thinking Parallel, Part III: Tree Construction on the GPU

Originally published at: https://developer.nvidia.com/blog/thinking-parallel-part-iii-tree-construction-gpu/

In part II of this series, we looked at hierarchical tree traversal as a means of quickly identifying pairs of potentially colliding 3D objects and we demonstrated how optimizing for low divergence can result in substantial performance gains on massively parallel processors. Having a fast traversal algorithm is not very useful, though, unless we also have a tree…

Its not really clear from the paper - http://devblogs.nvidia.com/... as to how the Octree nodes are allocated and how parent-child relation between the octree nodes are established ?

Hi ertf23, I'm not sure if you have figured it out already... I have
thought for a (relatively) long time about it and think that I came to a
working solution (I am trying to code it in a few minutes.) I'm not
sure but if you or anyone else is interested I am willen to share my
code. Greetings, Neko (and sorry for my bad english, I still am learning that language)

I did figured out both - how to allocate octree nodes and connecting the octree nodes in parent-child relationship. It also took me a long time to find the solution. I think my solution is sub-optimal especially the algorithm for connecting nodes in parent-child relationship. It will be great if you could share the C or pseudo code on github or ideone. I will share mine too. We can then figure out the fastest algorithm

is pastebin okay too? (for now, we can work on a project at github if I get all my code snippets together ^^" )
And do you prefer CPU code with "virtual" parallelism?
I mean like:
#pragma omp parallel for schedule(dynamic, 32)
for(int idx = 0; idx < max; ++idx)
dosth(idx);
or would it be better if I write the CUDA code already?

I have just fixed out a bug where multiple equal morton codes generate a broken tree.

On my Intel i-7 4800M the sequentiell algorithm takes 4.07 seconds and the parallel (with 8 threads) takes 1.625 seconds to build the tree, both on 50.000.000 Objects (binary search NOT implemented in the parallel one, still looking index by index there ^^" )

I will clean my code and as soon as I get the binary search in there you can see it (but for today I still have a more important work to do, so it may take 1 or 2 more days but I guess it doesn't add much to 10 months)

About the image: I build an example main function that tests the CLZ(0) (because I was unsure if it will give me 32 or 31 ^^" haven't removed it yet, my fault)

it creates a list of 50 Million points from [0,0,0] to [1,1,1] and generates the morton code for each object

then sort it by the morton code

Then I count through the objects and count morton code that are twice, trice, 4 times or 5 times in the list (for output reason, the old version generated false trees as soon as there were more than one group of 4 equal codes...)

generate a tree with the old sequential methode and another tree with the parallel methode

then I walk through every node of both trees, count the times and the objects in the tree (for validation)

destruct the trees and tadaah

//TODO: Clean the code...

Pastebin or any other code sharing website will also be fine. Even if your code is broken, I am alright with it. Just need to know the high level algorithm for connecting octree nodes in parent child relationship. You can share c,c++ or cuda c or even pseudo code. Below is the link to the pseudo code I wrote. If you have any trouble viewing it, please let me know. The first page is terminology, second page is Octree node construction algorithm, third & fourth page present two algorithms for linking the octree nodes in parent-child relationship

https://docs.google.com/doc...

Okay, I'll read it in a sec (after commiting to github... still kinda figuring out the linux way... new to linux ^^")

https://github.com/Ooken/Ne...

Here you can find it in the "Tree" folder

I always wanted to write a path-tracing engine (already did kinda like that... but it was half a year ago and very slow, without bvh's and other nice stuff... only simple diffuse surface, spheres, endless planes and so on...)

I hope you like how I write code and I do invite you to work on this engine together (I aim it to be realtime on one side but nice-looking on the other side. Anyway nice and readable code/easy to use code is the main goal) since I think you are a nice person and know how to code(?).

Also sorry for packing it into one file (I tried to hold it simple for this example)

*just red your psedocode* I'm a little bit confused...
Give me a little more time to understand it...

Oh, and before I forget it:
I found a solution for "multiple equal morton codes", also I already tested with 30mio, 50mio and (ram kill) 100mio objects, also with some equal codes inserted by hand at the front, randomly in the middle and at the end and it generated fully valid trees (was kinda fun to write a self-diagnostic code and let it run over night and while in school, ran 18 hours without a memory leak and without an error, I guess that should do it) ^^

Hallo Neko

Vielen Danke schön!! :))

Thanks for sharing the code. It looks good. Actually what I am looking for is not how to make Binary radix tree but how to create Octree using the the binary radix tree. If you go to page 4 of the original paper by Tero Karras, on the right column you can see a section titled "Octrees".

I am copy pasting the below the steps needed to create octree as per this paper:

(1) calculate Morton codes for the points,
(2) sort the Morton codes,
(3) identify duplicates, i.e. points falling within the same leaf
node, by comparing each pair of adjacent Morton codes,
(4) remove the duplicates using parallel compaction,
(5) construct a binary radix tree
(6) perform parallel prefix sum to allocate the octree nodes, and
(7) find the parent of each octree node.

Your code implements only steps 1-5 but not 6 and 7. I also implemented steps 1-5 similar to the way you implemented but I also implemented steps 6,7 (sub-optimal). The pseudo-code I shared with you performs steps 6 and 7(sub-optimal).

The author gives some description about steps 6 and 7 but the algorithm for step 7 is not clear, especially the last cryptic line : "The parents of the octree nodes can then be found by looking at the immediate ancestors of each radix tree node."

So what I am looking for is optimal solution to step 7.

hi!

hello Neko
how to deal with the case of duplicate Morton codes?

hi!
hello Neko!
how to deal with the case of duplicate Morton codes?

Just scan the whole array of objects and remove objects with duplicate morton codes. Another way is to somehow prevent duplicate morton codes.

" remove any objects" I think this way is not feasible.

I read the paper http://devblogs.nvidia.com/...

The Chaper 4 give a solution , but I did not understand it. Can you help me.

I did it like this:
mcodeleft = lst[index-1];
mcoderigt = lst[index+1];
when (mcodeleft == mcoderight)
then set index as start point and go through the elements, till there is a diffrent element, also change the "find split" function, so it doesn't return the middle, when the start and end elemet are equal but return the first element, so that when there is a group of equal codes, the subtrees will start at the second element and will go on like "left node = object, right node = subnode index+1" and so on, till all objects are included

so for example if you have multiple triangles with the same midpoint, they won't be deletet

@nurabha:disqus Could we get in touch in a more "direct" chat? I want to help you with the octree but somehow I don't get the "parallel concept" (I do get the concept of ocrtees but I don't get the parallel concept...)

and I ask again, are you willen to participate in coding a nice to use (I hope realtime) Path/RayTracer API in c++&CUDA?

did this code deal with duplicate Morton codes?
https://github.com/Ooken/NekoE...

if not could you wtire an short example

I fixed all bugs and yes, both examples deal with duplicate morton codes ^^

I achieved it by testing index-1==index+1 and if both are equal I will do a forward search till a diffrent morton code comes, also I changed the "find split" function so that it splits in equal-morton-code-cases always at the beginning so that it will go down like stairs and the nodes A is always the object and B is always the node to the other objects/the last object ^^

Do you understand my idea? If not I can try to visualize it a little :3

Also the BoundingBox project already deals with triangles, bounding boxes and duplicate morton codes (if there are any by randomness or made ba hand)

hi,Neko
Ok, I will read it after a minutes.
Can I get an email address from you? If I have some questions, I can send you email.

when the BVH tree is build, how to travel the tree with raytracing?

After building the tree, I will add bounding boxes (as described in the BoundingBox project in my Git :3 )

When tracing I will start at the root node, check it's bounding boxes and when it hit I will test both children, and whenever I hit a bounding box, I will test it's children and so on.

If you got more questions you can ask me at nekoyuke@gmail.com, but if you follow my project along I will try to build a fast, easy to understand and to use ray/path-tracer and a few sideprojects (like the TimeLapseCapture program I just wrote ^^" OpenCV c++ code will follow)

In general I try to update it every day :3

Ok I will try and follow you project.
And I guess if there is an efficient way to travel the bvh tree, becase we build the tree using Mordon tree?

Hi!

I'm also trying to build an octree, but I can't figure it out. How does Karras means, that "δchild/3┘−└δparent/3┘ are divisible by 3"? How do you calculate those counts for the number of octree nodes? How do you decide, which radix tree node becomes an octree node?

Thank you!