ray tracer choosing tools

Hi every body
i am going to write a ray tracer on GPU hoping to get realtime performance ( O:) ( probably on G80 - 8800 Ultra ). I should do in using glsl and i have no experience on parallel computing.
Do you know some general guide on how to get this? I have a lot of doubt about ( how to manage secondary rays for example).
thanks for your help!

I have also read something about a KD-tree that is faster than the 2 versions you mentioned, googling CUDA KD tree should find the paper.

I myself am doing another type of raytracing (find all possible ways between 2 points) and am thinking of using something like a KD-tree to speed up my blocking-testing. Have you found any good reading material about what exactly to put in a KD-tree? Does it contain vertices, or faces? And is the number of faces per leaf constant, or variable?

Do you mean the stackless version usign ropes on the leaves? I read it and i have also read “Realtime Ray tracing on GPU with BVH-based Packet Traversal”: they should have more less the same performance ( the authors are the same ) but they seem really hard to be implemented without any experience on cuda. I was looking for a glsl implementation trying to exploit SH4.0 or some simple ray tracer cuda source code.

Yeah, the stackless version indeed.

Well, I have ported now some matlab raytracing code to CUDA, it is not too difficult, but I cannot call myself inexperienced anymore.

If you have no experience with either GLSL nor CUDA and you do not need to be able to use non-CUDA cards, I would definitely go CUDA. I had no experience with either, but did not feel comfortable to try GLSL, started with CUDA (simple stuff, extending examples from the SDK). Now I find the mistakes I make are more C mistakes, than CUDA mistakes. Mainly because of coding matlab for too long, with too little C.

I would add the CPU + XXXX options too… then implement all and compare the results! ( so your investigation will be more complete! ).

Btw… do you plan just to trace spheres on real-complex triangle sets? What about dynamic geometry too?

Of Course as much i can get! :) anyway my aim is to target real-complex triangle set and maybe in the future dynamic geometry ( there are a lot of meaning of dynamic anuway…)

Using cuda do not mean to use parallel programmingand tread syncronization?

If you are a little comfortable programming in C/C++ and never did something with graphics programming take CUDA as it is much easier to understand than GLSL.

I myself am also making a raytracer. Not the usual raytracer (medical application) collecting all voxels along the path of a ray. So I understand the principle of the normal raytracer but am not familiar with those techniques you are using. So I would love to see some timings when you’ve finished those implementations.

Hello fellow raytracers! Just kidding…

Anyway, I’m another guy currently into raytracing, but for my Master’s degree (!). My experience so far, as related in the recent “GLSL vs CUDA” thread, is that GLSL appears to be faster with my current implementation.

I have ported a fragment shader that performs the entire raytracing of primary rays only from GLSL to CUDA. It uses a uniform grid as acceleration structure, since it is reasonably simple to build and traverse. Depending on the scene I can get about 8M rays/sec. The CUDA version at best is able to match that performance.

I haven’t found any use for SM4.0, only minor things like integer addressing to textures. No geometry shaders, transform feedbacks and stuff like that :).

Also, I have some experience with kd-trees on the CPU world. And I must say they’re a pain in the ***. Specially the SAH building. Too many corner cases and precision problems… But they’re at least 2x faster for raytracing than the uniform grid, for example. Never tried a BVH, but that latest paper using it in the GPU was rather interesting. Take a look at kd-tree vs BVH construction and traversal algorithms and choose the one that looks simpler :).

IMO, the easiest option is to implement acc. struct. building on the CPU and just use the GPU to perform the actual raytracing. I don’t know which option I would recommend: GLSL or CUDA. For me, a naive mapping to either API is similar in difficulty. That is, each fragment/thread traces a single ray. You can start from there and maybe experiment with packets/frustums later on.

Sorry I haven’t voted, I just have no experience with any option :(.

DenisR, you should check Vlastimil Havran’s and Ingo Wald’s thesis. They’re the best reference I’ve found about using kd-trees. Havran mainly discusses the SAH principle, while Wald is more about implementing fast construction and ray traversal (but watch out for some “hidden” complications in his work). There is also a publication called “On building SAH kd-trees and doing that in O(nlogn)”, or something like that. It’s really good to get a grip about SAH construction, since it seems to be used for BVHs as well.

Oh, on the CPU world kd-tree internal nodes hold:
. a split plane (axis + position)
. pointer to left child (right = left + 1)
. flag to indicate leaf node

While leaves hold:
. pointer to triangle references
. triangle total

Some algorithms, like MLRTA (good read!), require additional data such as leaf bounding boxes (AABB).

Wald’ I have, only no time to read it yet. I also found a paper on accelarating structures for raytracing on GPU. Don’t have it handy, but that one also contains example code, so I hope that will be a good starter.

Thanks for the tips. When I am a little smarter, I will report back.

Hello Pins! As i can see i am in very simmilar way invoveld in CUDA raytracing because of my Master’s thesis too :D I was doing only little research on this because i started on the project just recently so i have to read as many helpful materials as possible because my goal is to implement “perfect” raytracer which should outperform my colleague’s SM3.0 RT in both speed and quality :P (if it is possible O:) ) But still i’m just rookie in this :blink:

Heh, Vlastimil Havran used to be my former master thesis supervisor and it’s quite posssible that he could be my opponent with current thesis.

So GL to everybody who is trying to implement any kind of Raytacer with CUDA :thumbup:

PS.: I really need fully functional CUDA 2.0 (!) :D

How have you guys implemented a ray tracer since CUDA doesn’t support recursion? The ray tracing algorithm requires recursion once its time to perform reflection, transmission, etc.

You can convert any recursive function to an iterative function.

Alternatively you could write an “Uber” function that handles tracing, shading, reflections, refractions all in one go…that’s what I have done.

I have written a Ray Tracer on CUDA 2.0 capable of handling dynamic scenes with textures.

Currently my RT goes upto depth = 3.

An yeah I am using a BVH (built top-down, not SAH Based) to speedup things.

Apparently BVHs are simpler to construct than kd trees and better suited for dynamic scenes.

Currently I don’t use shared memory so I don’t get enough of a speedup but yeah Real Time Ray Tracing is what I have demonstrated to my supervisor :D .

We’re still looking for ways to optimize ray tracing so if anyone has any bright ideas don’t forget to post 'em here. :)

I have spent many months trying to figure this out so if any one needs help on this I would gladly help. :)



OK I have some questions. I am doing another kind of raytracer, where I am looking at all possible paths between points. So for me I would be using an acceleration structure for seeing of the rays I find are blocked by triangles, so in reality do not occur.

I have been looking at acceleration structures for this blocking stage. I have not tried to implement any yet, as I am still busy coding parts where I find possible rays. As far as I know now, there are some ups, downs and for me some unknowns ;)

KD-tree : not trivial on GPU, because no recursion is allowed. Options : kd-restart, kd-backtrack & the trick to put an extra pointer. splits 1 dimension per node.
BVH : one triangle can be part of more than 1 BVH, so I think there is some extra memory required. Also I do not yet understand how to traverse the hierarchy, because as far as I can see now, you might need some recursion.
BIH : a bit of a mix of the two above. splits 1 dimension per node, triangles can only exist in 1 part of the split. I also do not yet see how to recurse this.

One part that is different for me is I have a ray from x1,y1,z1 to x2,y2,z2 and have to see whether it is blocked (anywhere in the path). For raycasting (what normal raytracing is really doing as far as I can see) you have a starting point, a direction, and you want to find the first intersection with a triangle.

Hopefully coming week I have some time to start implementing an acceleration structure. Maybe I will first do it in matlab, to understand them more quickly ;)

KD-tree is THE fastest ADS for static scenes…if you’re doing dynamic you’re better off with a BVH because it’s easier to reconstruct it…several papers exist for doing the Selective Restructuring.

You dont need a BV for a triangle. That wont give any speedup.

Because you would then be doing 1.5 times the number of intersection tests(on an average). BV should exist for a reasonably sized model. and when you reach the leaf of the BVH just do the triangle intersection tests.

I have a few questions of my own…I am using shared memory just for bringing some geometry into the SMEM so that my warps can Ray Trace through geometry kept in the SMEM to give me speedups.

I havent got any speedups as compared to without SMEM ?

I still dont “coalesce” my global memory reads.



Do all your threads need all that geometry? Then you should see some speedups. But if you get your geometry through a texture fetch, you might not have a lot of speedup because of the cache?

I have found some example code for BVH. I now understand how to put the structure & triangles in memory :)

Currently I bring all the geometry in the shared memory(the geometry is not too big)

in the hope the threads shouldn’t have to access the local memory…Ideally I should bring only that geometry that my warps need(when I get to massive models).

Also, some recent discussions with my supervisor revealed that GLSL RayTracer could pawn a CUDA RayTracer as the pipeline is optimized for GLSL while CUDA is being imposed on the hardware. :ermm:

Regardless, I remain optimistic. :yes:





Maybe you can post the address, where you have found the example code for the BVH?

I am trying to implement the paper of Günther et al. “Realtime Ray Tracing on GPU with BVH-based Packet Traversal”. Normal raytracing works until I have “only” 1000 Triangles, so I need a data structure for raytracing… but I have no idea how to bring up the BVH on the GPU… Maybe someone has an advice for me :thumbup:

Greetings thopil

I found the example code in the paper comparing structures for GPU raytracing. It has example code for BVH, Octree, KD restart and KD backtrack.


Yep seen it before…

Theirs is basically a GLSL-type implementation of a Ray Tracer (with vertex and fragment programs)

I think there’s a thesis by Jeffrey Mahovsky from University of Calgary on BVhs(dated 2005,on CPU of course)and by Carsten Benthin from Saarland University dated 2006 on packetising BVH traversal…which with a few modifications can be helpful for coalesced/coherent traversal on GPU.

I havent coded the BVHs myself up yet but am in the process of doing it now. I am caught between writing a software managed cache and a fixed BVH depth iterative traversal.

I would like to know the kind of Frame rates you’re getting for your RT.

Also do you find yourself limited by registers or memory latency ?

Compile with --ptxas-options=-v to find out how many regs, smem bytes and lmem bytes your kernel is using.



hi Akshay

i ask too much if i could get your code or also just binary for testing purpose, i am writing it in GLSL more less using the same strategy ( BVH…).

I can give my email for that.

thank you