Is it sane to use acceleration structures with CUDA based raycasting

Hi all,

I have been reading a bit on doing GPU accelerated raycasting on CUDA architecture specifically this recent thesis (http://ivokabel.wz.cz/pages/myWorks/Ivo_Pavlik_-_Thesis.pdf), and probably the first paper on doing this on CUDA (http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4634648). I was wondering why the the performance results never show the comparison to a naive implementation (without using any acceleration structure) because from my experience using any acceleration structure on GPUs, the performance is worse and to get the performance benefits, you need to tweak your kernels to get optimum performance. One more thing, just looking at the performance results, the max frame rates are around 20 - 25 fps for the approach given in the Pavlik thesis. Whereas the basic volume shipped with CUDA sdk gives around 30 fps unoptimized and with a little bit of optimizations can easily give arund 60 fps.

Does anyone else feels the same?

Regards,

Mobeen

Hi all,

I have been reading a bit on doing GPU accelerated raycasting on CUDA architecture specifically this recent thesis (http://ivokabel.wz.cz/pages/myWorks/Ivo_Pavlik_-_Thesis.pdf), and probably the first paper on doing this on CUDA (http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4634648). I was wondering why the the performance results never show the comparison to a naive implementation (without using any acceleration structure) because from my experience using any acceleration structure on GPUs, the performance is worse and to get the performance benefits, you need to tweak your kernels to get optimum performance. One more thing, just looking at the performance results, the max frame rates are around 20 - 25 fps for the approach given in the Pavlik thesis. Whereas the basic volume shipped with CUDA sdk gives around 30 fps unoptimized and with a little bit of optimizations can easily give arund 60 fps.

Does anyone else feels the same?

Regards,

Mobeen

The speed-up techniques, as a rule, impose some overhead so, for some scenarios brute-force may outperform more sophisticated solution. The scalability is commonly the major reason to pay some overhead to obtain more robust solution across broader spectrum of practically relevant cases. There are many different kind of “scalabilities” the most common for volume rendering is the performance dependency on the size of volumetric data; the brute force methods have cubic time complexity - the best adaptive VR may have logarithmic time complexity. If overhead to obtain “logarithmic time complexity” is significant then there is some size threshold where brute force may outperform adaptive VR technique.

Do you really mean with acceleration structures? Because with models with >10k triangles I cannot see a naive implementation being better than a solution with an acceleration structure.

The original question was about performance of volumetric ray-casting on GPU/CUDA, more general term - high quality volume rendering . You may read the article from original post: http://ivokabel.wz.cz/pages/myWorks/Ivo_Pavlik_-_Thesis.pdf

“>10k triangles” - more then 100-millions triangles are required to mimic a volumetric ray-casting output similar to http://upload.wikimedia.org/wikipedia/commons/b/b5/Croc.5.3.10.a_gb1.jpg

While GPU remains a SIMD machine (few packed together) it is not quite suitable for an efficient implementation of adaptive rendering techniques, since code-path (instructions-path) for any adaptive rendering technique is driven by local context of data around each ray therefore, it is a task-parallelism not a data-parallelism problem so, no surprise that the best CPU-based multi-threaded volumetric ray caster (running on dual X5650) outperforms dramatically ANY known (up2date) GPU implementation of volumetric ray-caster. Please note that it is not impossible to implement an adaptive ray casting VR on GPU, it is problematic to have an efficient implementation - performance per transistor or per Watt is doomed to be lower then multi-core i7 may provide (efficient MIMD machine).

However, you may obtain an excellent VR performance on GPU via Texture Mapping (TM-VR) ; in fact, the brute force TM-VR runs ~x100 times faster on GPU (for the similar price range hardware). The problem with TM-VR is that it has cubic time complexity thus, data-size and quality-super-sampling scalability is an issue. There are few publications to merge adaptive technique with TM (to apply an adaptive density per 3D texture bricks) also several commercial VR engines seem use such hybrid technique (in my opinion).

Stefan

Well, the volume renderer example was mentioned as an example as far as I read the original question :) I have no clue about volume rendering ;)