Porting my renderer from C++ to CUDA - the speed gains and their cost.

I spent last weekend porting my SW-only renderer to CUDA. I blogged about my results here, where I also share my code in GPL. Note that this is not about “doing graphics with CUDA instead of OpenGL” - it was just an exercise to port some algorithms (that happen to be graphics algorithms) to CUDA, and my experience in doing so.

Feel free to comment on my efforts:

  • any positive feedback and/or algorithmic suggestions most welcome.
  • please refrain from bashing, I tried to be objective :-)

Kind regards,
Thanassis Tsiodras, Dr.-Ing.

wow, thank you for posting this. I think in the general cuda or programming forum you would have received more response, as what you describe is not really Linux specific.

Note that when you ignore the shadow flag at the end, CUDA’s dead code optimization kicks in and throws out all computations related to that variable. That reduces the kernel’s register count and may make it much faster (no more spills to local memory for example)

I once did a triangle rasterizer with the tiled approach in shared memory as well, this was to implement the “Evolisa” algorithm in CUDA. I never finished the genetic algorithm though - all I had to show was hundreds of transparent triangles rendered at 600 fps ;-)

I put the line coefficients describing the triangle bounds into constant memory for best speed. For more than a given number of triangles I would have needed multipass due to constant memory size restrictions.

You are probably right about the dead code elimination - I’ve anyway decided to re-write the algorithm from scratch, implementing a Bounding Volume Hierarchy acceleration structure (instead of the “binning” I am using in the article). In all the papers I’ve read, it appears that this structure is the optimal one for CUDA-based raytracing.

We’ll see!

I have completed the implementation, and blogged about the journey of moving the code from C++ to CUDA.

The resulting raytracer runs in real-time (around 10-20 times faster than the OpenMP/C++ version), and has the following features:

    Real-time raytracing of triangle meshes - my 70$ GT240 renders a 67K triangles chessboard with Phong lighting, Phong normal interpolation, reflections and shadows at 15-20 frames per second. Interactive navigation and rendering mode changes are allowed (see video at my page, linked above). Overall, compared to the pure C++/OpenMP version, the CUDA implementation runs 10-20 times faster.

    A Bounding Volume Hierarchy using axis-aligned bounding boxes is created and used for ray/triangle intersections. The BVH is created via the surface-area heuristic, and is stored for fast re-use. If SSE are detected during compilation, a SIMD implementation is used that builds the BVH faster.

    CUDA 1.2 cards like my GT240 have no support for recursion, so I used C++ template magic to implement compile-time recursion - see cudarenderer.cu in the source tarball for details.

    C++ template-based configuration allows for no-penalty runtime selection of (a) specular lighting (b) Phong interpolation of normals © backface culling (e.g. not used in refractions) (d) reflections (e) shadows (f) anti-aliasing.

    Z-order curve is used to cast the primary rays (Morton order) - significantly less divergence => more speed.

    Vertices, triangles and BVH data are stored in textures - major speed boost.

    Screen and keyboard handling is done via libSDL, for portability (runs fine under Windows/Linux/etc)

    The code is GPL, and uses autoconf/automake for easy builds under Linux. For windows, the required MSVC project files are included, so the build is just as easy (see instructions at my page, linked above).


Thanassis Tsiodras, Dr.-Ing.