Is there any performance difference implementing a ray-tracer in cuda vs. rendering pipelines?

(bare with me for a long introduction)

A big part of my interest in GPU computing is on photon Monte Carlo (MC) simulations. We have an NIH funded software project called mcx ( and we have been optimizing it and adding new features (in the past, we also got a lot of valuable help from folks in this forum, thank you!)

The core of the MC simulation is essentially a ray-tracer - a photon is first launched from a location, does a bunch of scattering and absorption events, and then exits the domain somewhere, then we start the next one until all photons are simulated. CUDA/GPU allows this to be done every efficiently using massive threads.

However, there are some major differences between our “ray-tracer” and the typical ray-tracing in rendering tasks, namely

  1. each ray (a photon packet) typically experiences many (hundreds) scattering events before exiting - using the terminology I heard from graphics talks, it is mostly performing “sub-surface scattering” (but it certainly can handle less scattering or transparent media - in that case it behaves just like a typical graphics ray-tracer).

  2. the optical properties are typically associated with volumetric elements (voxels or tetrahedral elements) instead of like in graphics rendering tasks where optical materials are associated with surfaces (triangles).

  3. we have a very efficient acceleration structure - where the photon is either bounded within a voxel (only need testing intersections with the 6 facets) or bounded by a known tetrahedron (only testing ray-triangle intersection with 4 triangles) for each photon movement.

  4. we need to save volumetric data along photon trajectories - either in a voxel grid or in a tetrahedral mesh - to represent light intensity (fluence/fluence rate) in 3D space. In comparison, most graphics renderer only cares about the RGBs on a 2D camera pixel space.

My codes were written in CUDA (and separately, OpenCL). These have been working quite well, and I can see hundreds to thousands fold speedup compared to a CPU thread, which I am quite happy. But the recent buzzes in RTX and tracer core from NVIDIA caught my attention again, making me keep wondering if I can get significant speed improvement by somehow porting my volumetric ray-tracing code using the new ray-tracer hardware.

However, from what I read (extremely limited), the interface to the ray-tracer seems to be limited to rendering APIs. Given the above major differences (different optical property attachment, different output format, different acceleration structure), I don’t really see a clear pathway to port my cuda/opencl code to OpenGL, Vulkan or OptiX - again my understanding to these programs are extremely basic.

So, my questions here for everyone are

  1. can my code (CUDA/OpenCL) directly benefit from the new ray-tracing hardware without major changes?

  2. if not, is there any way I can modify my CUDA/OpenCL code in order to use the new hardware and do the ray-tracing (but my kind of ray-tracing) more efficiently?

  3. if I can not keep the CUDA/OpenCL framework in order to use real-time ray-tracing functionality, then, which of these models (OpenGL, Vulkan, OptiX) will likely give me sufficient flexibility to implement my MC-like ray-tracing?

  4. Is there a metric that I can measure, for example, ray-voxel or ray-triangle intersection testing per second, in order to get me an ideal how my cuda code is doing in comparison to those reported real-time ray-tracer benchmarks? I’ve heard several mega-rays per second but don’t really know how many scattering/reflection on average for each ray in those benchmarks.

sorry for the long question, but I think some comments along this line will really help me understand how feasible to advance MC using the new hardware resources.

  1. no, probably not
  2. not at the current time
  3. Optix is the recommendation from NVIDIA at this time, for this type of work. Yes, there is graphics API support (DX, OGL, etc.) for ray-tracing, using the RTX cores, but that would be even further afield from the type of environment you’d find in Optix, or what you have currently in CUDA.
  4. There is a rays-per-second metric associated with the RTX cores, and the current crop of Turing RTX 20-series cards are in the range of 5-10 million rays per second. This is something like a ray until it hits something. The scattering you’re talking about would be modeled as new rays.

thanks Robert. I found some OptiX examples (interop with cuda), will take a closer look. with a quick glimpse, it seems OptiX ray tracing is still a bit too heavy-weight (or overkill) for my calculations, but too bad that CUDA can not use these new hardware directly.

I just bought an RTX 2080 and installed it this afternoon. I also have 2x Titan V in the box previously. Overall, 2080 runs about 50% faster than 1080Ti (1080Ti with 377 driver was actually slightly faster), but 20% slower than Titan V, which is what I had expected. I imagine 2080Ti will be closer to TitanV.

I don’t think this is directly comparable because the types of computation are quite different. I did a very rough test using a simple problem. The domain is a grid made of 60x60x60 voxels, filled with a background medium (scattering coeff mus is 1/mm, meaning that mean-free-path is 1 mm - that is, on average, photon changes direction every 1 mm distance along the trajectory). I launched 1e8 photons. The 2080 GPU reported a speed of 68k photon/ms, which is ~70 million photons per second. Each photon on average experiences ~366 scattering events before termination. So, if you call each segment along the trajectory a new ray, currently, my code is doing about 25 billion “rays” per second.

of course, it does not do any of the RGB color calculations, but it does trace the photon’s intensity (luminance) and outputs a volumetric map of the luminance within the 60x60x60 voxels. if I turn on boundary reflection (Fresnel), the speed drops to 36k photons per ms.

Perhaps it was unwise of me to make that comment.

If I recall correctly from the last time I looked at your code, the ray directions, scattering events, and new ray directions after scattering are all computed based on random distributions. Is that correct?

If so it seems to me that calculation is nowhere near the complexity associated with BVH traversal to discover ray-object intersections, along with new ray generation in a geometrically and physically correct fashion.

I don’t really know how your problem would be mapped into RT cores. It might be a mismatch as you are suggesting.

In graphics ray tracing, using the RT cores, a reflected ray is considered a new ray from the point of view of throughput or performance calculations, as far as I know.

From looking at your code (and associated paper), here’s some possible ideas I had for a rough outline to start porting the code to use RTX (using the Vulkan “VK_NV_ray_tracing” extension as an example):

  • Ray tracing when initiated by the “vkCmdTraceRaysNV” command will normally cast a ray for each pixel in the specified image buffer dimension (width * height * depth) given in it’s arguments. So define the image buffer dimensions to have as many pixels as the total number of photons to launch, ie. image width * image height = total no. of photons to launch, while image depth = 1.

  • Map voxels into sets of triangle vertices and use it to create the VK_NV acceleration structures, ie. each face of the voxel is defined by two triangles.

  • The ray’s “rayPayloadNV” payload will be used store the values which the photon computes along it’s launched path: ie. absorption attenuation, etc.

  • Create a closest hit shader which stores it’s associated voxel’s values (absorption attenuation, etc.) into the ray payload. This closest hit shader will get called when a ray intersects a triangle on the face of the voxel. The shader can determine the triangle which was intersected by the ray and from that determine the voxel encountered by the photon in order to get access to the medium index for that voxel and thus get the medium properties needed to accumulate into the ray payload value.

  • Create a miss shader which stores a special value in the ray payload to indicate the photon has exited the domain.

  • Create a ray generation shader (this gets called for each ray cast per pixel in the image buffer) which instead of computing a ray origin and direction based on a camera viewpoint and the image buffer, will instead compute the ray origin and direction based on code from your “launchnewphoton” function. The ray generation shader then calls “traceNV” which will cast that single ray from the computed origin along the specified direction. Any intersected triangle will result in the ray’s payload containing the values placed into it by the closest hit shader, ie. values corresponding to the medium properties of the voxel of which the intersected triangle forms part of one of the voxel faces. The ray generation shader will then use those values to perform the accumulated photon-medium interaction computations. Then perform the logic to determine if more photons need to be launched and thus loop back to the “launchnewphoton” and “traceNV” code, or terminate the photon launching loop. After ending the photon launch loop, it can then write the resulting computed data into the image buffer, ie. each photon effectively has a vec4 of floats it can write into the image buffer as it’s computed data.

  • Then in the main program, after performing the ray tracing process, instead of using the populated image buffer to copy to the screen via “vkCmdCopyImage”, just read the vec4 floats directly from the image buffer to save as the result data.

So basically, we’re just using an VK_NV image buffer to launch a grid of rays equal in number to the total number of photons we want to launch, thus invoking the ray generation shader code for each of those rays to intersect voxels via the “traceNV” RTX call and have the closest hit shader code pass medium information values back to the ray generation shader code for it to perform the photon-medium computations needed.

I wonder though if using RTX would accelerate your code since effectively your code has every ray always intersect a triangle at a very small distance (ie. the width of a voxel) from the ray origin. Whereas usually RTX would cast a ray into what is generally empty space and use the acceleration structures to see if any triangle eventually gets intersected. With a port of your code, you’ve essentially made an extremely dense acceleration structure where an intersection is always guaranteed at a very small distance in any direction (unless the ray leaves the domain space).

thank you both for the very insightful replies.

First, let me reply to Robert’s comment. It wasn’t my intent to fully replicate the ray-tracing pathway for my application and still hope that my application will be accelerated, but rather, asking the question that what aspects of the ray-tracing in the rendering pipeline that I can take advantage (like the new hardware, data structure etc) specifically to accelerate my application, and, more importantly, which aspects of the rendering ray-tracers (like the BVH traversal, surface optical properties etc) that I can remove and customize it with my type of ray-tracing.

The above question is also directly related to programming framework - which of the frameworks is fine-grained enough to allow me to remove the unnecessary steps and adapt it with my data/workflow (voxel-based or tetrahedral mesh based acceleration structure, volumetric optical properties, volumetric data storage etc).

that is correct, the photons scatter zenith angle distribution follows the Henyey-Greenstein phase function, the scattering azimuthal angle is uniform distributed between [0 2*pi), and the scattering length follows the exponential distribution.

By the way, I also have a mesh-based Monte Carlo (MMC) algorithm ( that is much closer to a rendering pipeline/data structure than the voxel-based one (MCX). In MMC, a ray traverses through a tetrahedral mesh, for each step, it run only 1-4 (2.5 on average) ray-triangle intersecting tests to determine the intersection and next element, so this is a very efficient “acceleration structure (AS)” for ray casting. Incidentally, I recently found that there was a conference paper on graphics rendering using the same idea, i.e. using tetrahedral mesh as AS, published 2 years before my MMC paper

I also found a working cuda code using this idea for rendering

but again, if cuda can not use the new hardware for this calculation, there is no benefit to port my cuda code to this cuda code.

I want to thank cudapop1 for the detailed comments and suggestions. very very insightful! I had to read a bit more on rtx ray-tracing workflow to understand the API calls, but I think I understand most what you suggested (and very much attempted to give it a try).

I do have a few conceptual questions

  1. Acceleration structure: your suggestion of converting a voxel into a triangular surface model, and then utilize the traceNV to do ray-tracing makes sense to me. From what I have read s, the API to build AS is accelerationStructureNVX(), but it looks like it still aims to partition triangles in space using BVH, which in my case, is less efficient compared to, say, if I can tell the ray tracer that only 12 triangles to do the hit-test every step, or even better, by-passing the AS at all, and directly call the “closest hit” shader to use my “hitgrid” function (a ray-voxel closest-hit test in cuda) to find the intersections.

is this possible with the current API support?

  1. Handling scattering events: when a photon scatters, it changes the direction and scattering length (the max length of the ray, despite how many triangles it has hit). From my understanding to traceNV, this means I will need to terminate the current ray, and go back to ray-generation shader to create a new ray based on the desirable distributions. if this is true, then I want to know, can I accumulate the total traveled length of a ray in rtx api? and, how can I terminate a ray if I decide it has reached the scattering length (before it intersects with the triangle)?

  2. Writing volumetric output: does the rayPayloadNV only handle a compact data structure (like vec4) associated to a ray? or it can “accumulate” such payload into a shared data location, say in the global memory? the ray payload is kind of the “state” of the photon, which is now, in my implementation handled by 3 vec4 (p,v,f), but the accumulation of the photon state variable into a common volume is something that I don’t see clearly in the raytracing workflow. can this be done?

I appreciate very much for your suggestions, and look forward to hearing more.

Here’s some rough ideas on how I’ld approach trying to tackle the items you posted:

... if I can tell the ray tracer that only 12 triangles to do the hit-test every step...

The function “traceNV” accepts the AS against which to cast a ray as it’s first argument, so you could try passing it a different AS containing just the triangles you want to test for intersection. I haven’t tried multiple AS though, but it might be worth testing to see if it’s useful for your use.

... by-passing the AS at all, and directly call the "closest hit" shader to use my "hitgrid" function...

My understanding of the RTX hardware is that the RT cores silicon is designed to do BVH traversal and ray-triangle intersection. So if you use Vulkan’s feature to create your own custom shape intersection shader rather than using the built-in triangle intersection shader, I’m not sure if the Vulkan API will simply use the CUDA cores to execute your shader and not use the RT cores at all hence lose any possible RT acceleration benefits. Probably best to test it out in code to compare the execution speed of using the built-in triangle shape intersection shader vs. a custom shaped intersection shader.

... can I accumulate the total traveled length of a ray in rtx api? ...

You can accumulate the total traveled length as part of the ray’s payload (ie. as part of the parameter identified by “rayPayloadNV”). So that the closest-hit shader can access it for modification, and the ray generation shader can then read it to check the length travelled.

... how can I terminate a ray if I decide it has reached the scattering length (before it intersects with the triangle)?...

The 2nd to last parameter of the “traceNV” function is the “tmax” parameter, which specifies an upper bound for the distance in which intersections will be checked.

... or it can "accumulate" such payload into a shared data location, say in the global memory?

Vulkan supports what are called uniform buffer objects (UBOs) which allow you to pass memory structures which are in global memory to shaders. So if you place your global data in a UBO, your closest-hit and ray generation shaders can access/update it as needed.