Porting my renderer from C++ to CUDA: my journey

ttsiodras · February 6, 2011, 5:07pm

I have completed my implementation of a real-time SAH/BVH raytracer with CUDA, and blogged about the journey of moving the code from C++ to CUDA.

The resulting raytracer runs in real-time (around 10-20 times faster than the OpenMP/C++ version), and has the following features:

[*]Real-time raytracing of triangle meshes - my 70$ GT240 renders a 67K triangles chessboard with Phong lighting, Phong normal interpolation, reflections and shadows at 15-20 frames per second. Interactive navigation and rendering mode changes are allowed (see video at my page, linked above). Overall, compared to the pure C++/OpenMP version, the CUDA implementation runs 10-20 times faster.

[*]A Bounding Volume Hierarchy using axis-aligned bounding boxes is created and used for ray/triangle intersections. The BVH is created via the surface-area heuristic, and is stored for fast re-use. If SSE are detected during compilation, a SIMD implementation is used that builds the BVH faster.

[*]CUDA 1.2 cards like my GT240 have no support for recursion, so I used C++ template magic to implement compile-time recursion - see cudarenderer.cu in the source tarball for details.

[*]C++ template-based configuration allows for no-penalty runtime selection of (a) specular lighting (b) Phong interpolation of normals (c) backface culling (e.g. not used in refractions) (d) reflections (e) shadows (f) anti-aliasing.

[*]Z-order curve is used to cast the primary rays (Morton order) - significantly less divergence => more speed.

[*]Vertices, triangles and BVH data are stored in textures - major speed boost.

[*]Screen and keyboard handling is done via libSDL, for portability (runs fine under Windows/Linux/etc)

[*]The code is GPL, and uses autoconf/automake for easy builds under Linux. For windows, the required MSVC project files are included, so the build is just as easy (see instructions at my page, linked above).

Enjoy!

Thanassis Tsiodras, Dr.-Ing.

cbuchner1 · February 6, 2011, 9:59pm

Thank you for making this open source!

The related Reddit thread and the comments within are really worth a read: http://www.reddit.com/r/programming/comments/euxzx/porting_my_renderer_from_c_to_cuda_the_speed/

I might be looking into raytracing for radio propagation modeling in the future, so I hope I can learn some things from your implementation.

ttsiodras · February 7, 2011, 10:36am

Actually, you are pointing to the old Reddit thread, about the “bins” implementation - the new one, opened yesterday, is here.

And as for releasing the code openly - my pleasure.

Thanassis.

cbuchner1 · February 7, 2011, 11:50am

The good news: It built fine on Ubuntu 9.04 with CUDA 2.3 toolkit.

The bad news: The chess scene only renders at 3 FPS using a Compute 1.1 card (32 shaders)

(II) Feb 07 11:30:58 NVIDIA(0): NVIDIA GPU Quadro FX 580 (G96GL) at PCI:1:0:0 (GPU-0)
(–) Feb 07 11:30:58 NVIDIA(0): Memory: 524288 kBytes

It seems that the switch to Compute 1.2 cards like the GT 240 can provide a significant boost. It could be that the smaller register file in Compute 1.1 leads to register spills to local memory. Or it’s the better memory controller logic in 1.2 devices and better that leads to the improved performance.

Haven’t figured that out yet - I will try turning on the verbose output in PTXAS to see what’s going on.

Christian

ttsiodras · February 7, 2011, 12:56pm

Try:

Using the latest CUDA toolkit (3.2) - the new compiler may be able to do better optimizations.
Not sure about this, but try also “-arch sm_11”: Patch configure.ac …

-NVCCFLAGS=“-O2 -use_fast_math --compiler-options -fno-inline --compiler-options -fpermissive”

+NVCCFLAGS=“-O2 -use_fast_math --compiler-options -fno-inline --compiler-options -fpermissive -arch sm_11”

…and then, “autoreconf && automake && ./configure --with-cuda=/path/to/your/cuda-3.2/ && make clean && make”

Thanassis.

seibert · February 7, 2011, 6:12pm

This is very neat! Thanks for posting the code.

On my GTX 470, this renders the rotating chessboard at 45 fps, compared to 21 fps on one half of a GTX 295.

Edit: I should note that the GTX 470 is not the display card, so that 45 fps is even with buffer copies between cards for the OpenGL display.

seibert · February 7, 2011, 8:08pm

… and someone happened to stroll by my open workstation with a GTS 450 in their hand. That renders the chessboard with default settings at 19 fps. (Again, the GTS 450 is not the display card.)

Gregory_Diamos · February 8, 2011, 2:31am

Any chance that someone could roll this into a benchmark?

ttsiodras · February 8, 2011, 11:29am

It already has a benchmark mode - from the README:

"Since it reports frame rate at the end, you can use this as a benchmark

for CUDA cards. Just spawn with “-b” to request benchmarking:

./src/cudaRenderer -b 150 3D-objects/chessboard.tri

This will draw 150 frames and report speed back. With my GT240, it reports:

Rendering 150 frames in 8.117 seconds. (18.4797 fps)"

Thanassis.

ttsiodras · February 10, 2011, 9:11am

Updated, version 2.1f: Bugfix, the first triangle was never rendered.

Available at my site.

Gregory_Diamos · February 25, 2011, 1:16am

I get (0.094968 fps) on a core 2 duo laptop running on Ocelot’s PTX to x86 JIT, rendering the chessboard. I wonder how this would compare to the OpenMP version?

“Rendering 15 frames in 157.948 seconds. (0.094968 fps)”

Also, do you mind if I use your code as a benchmark for research into compiler optimizations? What would be the best way to cite your implementation?

EDIT: That result was for -O0 -g. The result for -O3 optimization is:

“Rendering 20 frames in 59.228 seconds. (0.337678 fps)”

cbuchner1 · February 25, 2011, 4:33pm

Benchmark done with the 2.1g release on nVidia GTS 250 (1GB memory)
Rendering 150 frames in 12.702 seconds. (11.8092 fps)

ttsiodras · February 27, 2011, 10:39am

Part of the reason I published it under the GPL is to make sure that it can be easily used (and extended) for academic research. By all means, use it, Gregory.

P.S. Use this for citation:

“A real-time raytracer of triangle meshes in CUDA”, Thanassis Tsiodras, Dr.-Ing, Feb 2011.

<http://users.softlab.ntua.gr/~ttsiod/cudarenderer-BVH.html>.

P.P.S: To compare with OpenMP, just download the SW-only version - it uses OpenMP or TBB to utilize multi-core CPUs.

kanishk · February 28, 2011, 3:17pm

On GTX 580 (cuda sdk 3.2, windows 7 32 bit, no buffer copies):

Stock clocks (memory @ 4008 MHz, shader @ 1544 Mhz):
Rendering 150 frames in 2.037 seconds. (73.6377 fps)

Overclocked (memory @ 4600MHz, shaders @ 1800 MHz):
Rendering 150 frames in 1.807 seconds. (83.0105 fps)

Gregory_Diamos · February 28, 2011, 8:41pm

Thanks, much appreciated.

arthurgregorio · September 28, 2011, 3:03pm

Very good your raytracer! congratulations! but I am having the following error when trying to compile (when i do make) it on linux:

Utility.cpp: In function â€˜void panic(const char*, …)â€™:
Utility.cpp:34: warning: format not a string literal and no format arguments
CXX cudaRenderer-BVH.o
CXXLD cudaRenderer
/usr/bin/ld: skipping incompatible /opt/cuda/lib/libcudart.so when searching for -lcudart
/usr/bin/ld: cannot find -lcudart
collect2: ld returned 1 exit status
make[2]: ** [cudaRenderer] Erro 1
make[2]: Saindo do diretÃ³rio /opt/renderer/cuda-renderer/src' make[1]: ** [all] Erro 2 make[1]: Saindo do diretÃ³rio /opt/renderer/cuda-renderer/src’
make: ** [all-recursive] Erro 1

I’m using CUDA Toolkit 3.2 x32, Ubuntu 10.04 x64 and a GTX580 when I try to compile with the CUDA Toolkit 3.2 x64 got the same error.

give me a hand?

ttsiodras · September 29, 2011, 3:54pm

I wish I could help - but I only have access to 32bit Linux environments.

The problem is clearly manifesting because of 64bit: the message “skipping incompatible” means that the linker found a 32bit cudart library, but couldn’t use it.

If you can’t use a 32bit building environment, you may be able to cope by adding “-m32” to the compiler/linker flags (to specifically request generation of a 32-bit binary).

Good luck!

Very good your raytracer! congratulations! but I am having the following error when trying to compile (when i do make) it on linux:

Utility.cpp: In function â€˜void panic(const char*, …)â€™:

Utility.cpp:34: warning: format not a string literal and no format arguments

CXX cudaRenderer-BVH.o

CXXLD cudaRenderer

[b]/usr/bin/ld: skipping incompatible /opt/cuda/lib/libcudart.so when searching for -lcudart

/usr/bin/ld: cannot find -lcudart

collect2: ld returned 1 exit status

make[2]: ** [cudaRenderer] Erro 1

make[2]: Saindo do diretÃ³rio `/opt/renderer/cuda-renderer/src’

make[1]: ** [all] Erro 2

make[1]: Saindo do diretÃ³rio `/opt/renderer/cuda-renderer/src’

make: ** [all-recursive] Erro 1[/b]

I’m using CUDA Toolkit 3.2 x32, Ubuntu 10.04 x64 and a GTX580 when I try to compile with the CUDA Toolkit 3.2 x64 got the same error.

give me a hand?

arthurgregorio · September 29, 2011, 5:05pm

tested using the 64 bit toolkit but also did not work…

where can I put this flag just mentioned by you? within the makefile? or do you recommend I use a 32-bit linux?

--------------- EDIT!

Just changed the directory where they look to the lib. / Configure to / lib64 and it worked!

thanks for the help!

Gaszton · September 30, 2011, 11:11am

Very nice work!

I runned the win32 app on a gt425m (msi fx700 notebook), results are:
everything on, except antialiasing: 3.4fps
with antialiasing: 1.8fps

Soon i will get gtx560 cards, i will test with them also.

Best Regards,

Gaszton

PS.: i was playing with your program and got this:

The nvidia OpenGL driver lost connection with the display driver and is unable to continue.
…error code: 8

Topic		Replies	Views
Porting my renderer from C++ to CUDA - the speed gains and their cost. CUDA Programming and Performance	3	10981	February 6, 2011
Accelerated Ray Tracing in One Weekend in CUDA Technical Blog	25	1768	February 23, 2024
CUDA 3D Rendering Mystery CUDA Programming and Performance	25	16039	June 16, 2010
Real-Time Ray Tracing with CUDA I succeeded! CUDA Programming and Performance	20	21037	June 27, 2011
New Features in CUDA 7.5 Technical Blog	66	1139	August 10, 2016
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76304	February 14, 2010
Reading R8G8B8A8 texture using tex2D() causes strange result. CUDA Programming and Performance	27	2910	April 28, 2018
CUDA very slow performance CUDA Programming and Performance	21	16801	March 6, 2020
Some newbie questions for raytracing with CUDA CUDA Programming and Performance	6	6761	April 25, 2008
CL_OUT_OF_RESOURCES CUDA Programming and Performance	6	4051	March 11, 2010

Porting my renderer from C++ to CUDA: my journey

Related topics