Porting my renderer from C++ to CUDA: my journey

I have completed my implementation of a real-time SAH/BVH raytracer with CUDA, and blogged about the journey of moving the code from C++ to CUDA.

The resulting raytracer runs in real-time (around 10-20 times faster than the OpenMP/C++ version), and has the following features:

    Real-time raytracing of triangle meshes - my 70$ GT240 renders a 67K triangles chessboard with Phong lighting, Phong normal interpolation, reflections and shadows at 15-20 frames per second. Interactive navigation and rendering mode changes are allowed (see video at my page, linked above). Overall, compared to the pure C++/OpenMP version, the CUDA implementation runs 10-20 times faster.

    A Bounding Volume Hierarchy using axis-aligned bounding boxes is created and used for ray/triangle intersections. The BVH is created via the surface-area heuristic, and is stored for fast re-use. If SSE are detected during compilation, a SIMD implementation is used that builds the BVH faster.

    CUDA 1.2 cards like my GT240 have no support for recursion, so I used C++ template magic to implement compile-time recursion - see cudarenderer.cu in the source tarball for details.

    C++ template-based configuration allows for no-penalty runtime selection of (a) specular lighting (b) Phong interpolation of normals © backface culling (e.g. not used in refractions) (d) reflections (e) shadows (f) anti-aliasing.

    Z-order curve is used to cast the primary rays (Morton order) - significantly less divergence => more speed.

    Vertices, triangles and BVH data are stored in textures - major speed boost.

    Screen and keyboard handling is done via libSDL, for portability (runs fine under Windows/Linux/etc)

    The code is GPL, and uses autoconf/automake for easy builds under Linux. For windows, the required MSVC project files are included, so the build is just as easy (see instructions at my page, linked above).

Enjoy!

Thanassis Tsiodras, Dr.-Ing.

Thank you for making this open source!

The related Reddit thread and the comments within are really worth a read: http://www.reddit.com/r/programming/comments/euxzx/porting_my_renderer_from_c_to_cuda_the_speed/

I might be looking into raytracing for radio propagation modeling in the future, so I hope I can learn some things from your implementation.

Actually, you are pointing to the old Reddit thread, about the “bins” implementation - the new one, opened yesterday, is here.

And as for releasing the code openly - my pleasure.

Thanassis.

The good news: It built fine on Ubuntu 9.04 with CUDA 2.3 toolkit.

The bad news: The chess scene only renders at 3 FPS using a Compute 1.1 card (32 shaders)

(II) Feb 07 11:30:58 NVIDIA(0): NVIDIA GPU Quadro FX 580 (G96GL) at PCI:1:0:0 (GPU-0)
(–) Feb 07 11:30:58 NVIDIA(0): Memory: 524288 kBytes

It seems that the switch to Compute 1.2 cards like the GT 240 can provide a significant boost. It could be that the smaller register file in Compute 1.1 leads to register spills to local memory. Or it’s the better memory controller logic in 1.2 devices and better that leads to the improved performance.

Haven’t figured that out yet - I will try turning on the verbose output in PTXAS to see what’s going on.

Christian

Try:

  1. Using the latest CUDA toolkit (3.2) - the new compiler may be able to do better optimizations.

  2. Not sure about this, but try also “-arch sm_11”: Patch configure.ac …

-NVCCFLAGS="-O2 -use_fast_math --compiler-options -fno-inline --compiler-options -fpermissive"

+NVCCFLAGS="-O2 -use_fast_math --compiler-options -fno-inline --compiler-options -fpermissive -arch sm_11"

…and then, “autoreconf && automake && ./configure --with-cuda=/path/to/your/cuda-3.2/ && make clean && make”

Thanassis.

This is very neat! Thanks for posting the code.

On my GTX 470, this renders the rotating chessboard at 45 fps, compared to 21 fps on one half of a GTX 295.

Edit: I should note that the GTX 470 is not the display card, so that 45 fps is even with buffer copies between cards for the OpenGL display.

… and someone happened to stroll by my open workstation with a GTS 450 in their hand. That renders the chessboard with default settings at 19 fps. (Again, the GTS 450 is not the display card.)

Any chance that someone could roll this into a benchmark?

It already has a benchmark mode - from the README:

"Since it reports frame rate at the end, you can use this as a benchmark

for CUDA cards. Just spawn with “-b” to request benchmarking:

./src/cudaRenderer -b 150 3D-objects/chessboard.tri

This will draw 150 frames and report speed back. With my GT240, it reports:

Rendering 150 frames in 8.117 seconds. (18.4797 fps)"

Thanassis.

Updated, version 2.1f: Bugfix, the first triangle was never rendered.

Available at my site.

I get (0.094968 fps) on a core 2 duo laptop running on Ocelot’s PTX to x86 JIT, rendering the chessboard. I wonder how this would compare to the OpenMP version?

“Rendering 15 frames in 157.948 seconds. (0.094968 fps)”

Also, do you mind if I use your code as a benchmark for research into compiler optimizations? What would be the best way to cite your implementation?

EDIT: That result was for -O0 -g. The result for -O3 optimization is:

“Rendering 20 frames in 59.228 seconds. (0.337678 fps)”

Benchmark done with the 2.1g release on nVidia GTS 250 (1GB memory)
Rendering 150 frames in 12.702 seconds. (11.8092 fps)

Part of the reason I published it under the GPL is to make sure that it can be easily used (and extended) for academic research. By all means, use it, Gregory.

P.S. Use this for citation:

“A real-time raytracer of triangle meshes in CUDA”, Thanassis Tsiodras, Dr.-Ing, Feb 2011.

<http://users.softlab.ntua.gr/~ttsiod/cudarenderer-BVH.html>.

P.P.S: To compare with OpenMP, just download the SW-only version - it uses OpenMP or TBB to utilize multi-core CPUs.

On GTX 580 (cuda sdk 3.2, windows 7 32 bit, no buffer copies):

Stock clocks (memory @ 4008 MHz, shader @ 1544 Mhz):
Rendering 150 frames in 2.037 seconds. (73.6377 fps)

Overclocked (memory @ 4600MHz, shaders @ 1800 MHz):
Rendering 150 frames in 1.807 seconds. (83.0105 fps)

Thanks, much appreciated.

Very good your raytracer! congratulations! but I am having the following error when trying to compile (when i do make) it on linux:

Utility.cpp: In function ‘void panic(const char*, …)’:
Utility.cpp:34: warning: format not a string literal and no format arguments
CXX cudaRenderer-BVH.o
CXXLD cudaRenderer
/usr/bin/ld: skipping incompatible /opt/cuda/lib/libcudart.so when searching for -lcudart
/usr/bin/ld: cannot find -lcudart
collect2: ld returned 1 exit status
make[2]: ** [cudaRenderer] Erro 1
make[2]: Saindo do diretório /opt/renderer/cuda-renderer/src' make[1]: ** [all] Erro 2 make[1]: Saindo do diretório /opt/renderer/cuda-renderer/src’
make: ** [all-recursive] Erro 1

I’m using CUDA Toolkit 3.2 x32, Ubuntu 10.04 x64 and a GTX580 when I try to compile with the CUDA Toolkit 3.2 x64 got the same error.

give me a hand?

I wish I could help - but I only have access to 32bit Linux environments.

The problem is clearly manifesting because of 64bit: the message “skipping incompatible” means that the linker found a 32bit cudart library, but couldn’t use it.

If you can’t use a 32bit building environment, you may be able to cope by adding “-m32” to the compiler/linker flags (to specifically request generation of a 32-bit binary).

Good luck!

tested using the 64 bit toolkit but also did not work…

where can I put this flag just mentioned by you? within the makefile? or do you recommend I use a 32-bit linux?

--------------- EDIT!

Just changed the directory where they look to the lib. / Configure to / lib64 and it worked!

thanks for the help!

Very nice work!

I runned the win32 app on a gt425m (msi fx700 notebook), results are:
everything on, except antialiasing: 3.4fps
with antialiasing: 1.8fps

Soon i will get gtx560 cards, i will test with them also.

Best Regards,

Gaszton

PS.: i was playing with your program and got this:

The nvidia OpenGL driver lost connection with the display driver and is unable to continue.
…error code: 8