Forward ray tracing in Optix

Hi, apologies if this has been asked before. I’m reading through the Optix documentation, and trying to work out if it’s possible to produce a forward ray tracing simulation using Optix. We are attempting to model light propagation within and around plants, so rendering the scene is not the issue. What I’d like to do is model light rays from a source, e.g. the sun, and then measure where they strike the model, if at all. Rays striking the model would then recursively travel elsewhere.

One other thing that would be required is we’d need to be able to accumulate some value on the model triangles, that represented how often they had been hit. I’ve read that rays have a payload, but is it possible to adjust some “payload” attached to the model, when a ray hits?

Many thanks,


Nothing in OptiX limits it to backward ray tracing. You can certainly do forward ray tracing in it.

For accumulation, you could come up with some novel payload to store the accumulation on a per-ray basis, but this seems needlessly complicated. Why not define an output buffer to store the accumulated values with one entry per triangle, and update it in your closest hit program?

Hi Michael, as ‘nljones’ said, you can use an output buffer with one entry per triangle. However, you will need to use CUDA atomic updates on the output buffer in case two rays hit the same triangle simultaneously. The update can become the bottleneck of such a program, depending on how much contention you have, but we have some tricks to make it faster. Please post back on here or email optix-help if you go down this path and need to optimize it.

We’ve been thinking of making an example showing forward ray tracing and accumulation since it comes up now and then.

Hi both, thanks for the quick response. A separate buffer sounds like the best idea, I’ll start with that. I have various projects on the go which will slow development, but will see how I get on. I will post again once I get to this possible bottleneck issue.

I’m not really sure how much contention we’ll get, depends obviously on the ray count, but in a sense the whole point is to use as many rays as possible, to get minimum noise in the simulation.

I gather from the documentation that it’s common to set up a number of threads, along a similar count to the pixels in the output image? Given I have no image, I assume it’s a safe bet to batch rays into groups of roughly this size? This may be a question I can answer for myself once I get started. Fewer rays on the go at once would obviously decrease contention, but then you don’t want to be returning to the CPU all the time?

You probably don’t have to set the batch size too carefully except maybe for interactive preview (launch, preview, launch, …), where smaller batches would give you more frequent updates.

It would be easiest to start with primary rays first, then add recursive bounces to the ray generation program later. See the optixPathTracer example in the 4.0 SDK for one way to do backward recursive path tracing.

Please keep us posted.

I too am working on a forward ray-tracing simulator based on Optix 5.0 and CUDA 9.0 on 64 bit Ubuntu with a GTX-980Ti GPU. I am at the point of having all threads write into an output buffer to record times and amplitudes of beam arrivals at a receiver, so I am very interested to hear in advance what can be done to avoid update bottlenecks, as I imagine that there will be a lot of contention for memory writes, with many rays arriving at nearly the same time.

Was there ever anything published about the “tricks” needed to avoid/minimise this?



Hi David. We haven’t published any official guidance on this. I would proceed roughly like this:

  • Start without any protection against thread contention for the output buffer(s). This will probably give you the wrong answer, but it’s a performance baseline.

  • Change to CUDA atomic writes using atomicAdd() or similar. Confirm that the output is correct. Measure perf drop from baseline. Also note that atomicAdd with integers is often faster than floats due to internal optimizations in OptiX and/or the driver. In your case if you can store your ‘time’ buffer in integer milliseconds, it could be faster than floats. I would test this for yourself, though.

  • If performance isn’t good enough, and you have memory to burn, you can reduce contention by duplicating your output buffers, or interleaving a single buffer, and having different threads write to different locations. Then you need a post-process launch that does the final N:1 buffer reduction, probably using a raw CUDA kernel. There are published ways to optimize these kind of reductions in CUDA using, e.g., using cooperative thread groups or warp-wide instructions.

  • If buffer duplication isn’t feasible, and you’re stuck, then please contact us over email.