I haven’t had a chance to try any ray tracing yet, but it wouldn’t surprise me at all if you’re running into one or more gotchas with your CUDA kernel(s). How many registers do your kernels use, what occupancy are you achieving? You may not be getting as much global/texture memory latency hiding as you may need. Are you managing to use the shared memory area to reduce or eliminate memory accesses?
I have been working with both ATI and CUDA. My code on ATI is faster than my unopimized code on CUDA. The reason is my program had a lot of memory dependence,
Here is what I did to improve:
1. Use texture cache. If your data structures is read only, just bind them to a texture cache. ( for my case , a linear texture cache) It just works, fast and easy.
2. try to use more share memory. It is easy to do with the application I’m working, but I don’t think that you can easily use share memory in ray tracer.
If I have time this summer, I would like to play around with ray tracer with CUDA too.
Is there a way to gauge register pressure in CUDA?
My acceleration hierarchies are allocated in linear memory and bound to a 1D texture at compute time. From reading posts here, it’s my understanding that linear memory may not play nice with the texture cache? Unfortunately, the hierarchy is too big to fit into a 1D array, and 2D arrays pose problems because either: pointers cost twice as much or must be generated at runtime with divisions and modulus operations.
Unfortunately, I haven’t found a use yet for shared memory other than for generating random numbers.
90M rays/sec is pretty good, but this is only for the cornell box scene. Depending on the scene, I see between 5 million rays/sec (on Stanford XYZ dragon with 7M triangles) to about 35 million rays/sec on smaller scenes like the Stanford Bunny.
I’m using a simple kd-tree like in Wald’s thesis, and I use 1D linear texture for lots of stuff, triangles, texture coordinates, the kd-tree itself, etc…
I also HAVE found use for shared memory, which made for quite a speed up… if you store your origins and directions in shared, you can index into them without funky tricks.
well, I have had expirience with ray tracing, and i really doubt that it will EVER be in videogames. it just takes ot long. Maybe cut scenes, but not in-game…
for me, on my pc, (xps 600, Dual 7800 GTX) ray tracingcan take about 30 seconds (per frame) at 800 * 600, no AA, on Cinema 4D. I dont know if that program is efficeintr, but its ok for doing work…or screwing around :D
What kinds of scenes are you rendering? I can render the Stanford Dragon, which has 7 million triangles, a couple of times a second at 800x600. This is only dot product lighting, but still… Note that many game levels have 10s of thousands of triangles. I am seeing 30 frames/second ray tracing scenes like this, though I must admit there is no fancy stuff going on… no shadow rays, no antialiasing, no reflections. Which makes it pretty lame compared to OpenGL graphics :)
I agreethat ray tracing folks have several problems to solve before ray tracing could be considered real-time for gaming purposes, however, I believe it’ll happen in the next year or two. It’ll be a while before it is heavily used in games: My prediction :)
Not the least of which is finding a compelling application of ray tracing to a gaming setting in the first place. In my opinion, tracing eye rays or reflection/refraction rays isn’t it, especially considering the well-established rasterization-based alternatives.
How are you traversing the kd-tree? Did you implement the restart algo, or a short stack maybe? If you implemented a stack, how easy was it to get going? The paper linked above mentions great dificulties getting the short-stack variant to compile so they resorted to fixing up the generated native assembly under CTM.
So far I’ve implemented a bounding volume hierarchy with static traversal order, kd-restart, and kd-shortstack. As it is, the BVH is the fastest, then kd-restart, then kd-shortstack. For me, register pressure seems to be the bottleneck, not memory latency.