Concurrent access and growable buffer


I am doing a study for a client to see if OptiX can be used for its scientific needs.

We defined a simple use case that should represent something useful for him. Most of the things are quite trivial to develop but there are 2 things that seems tricky to implement:

The first thing that the client want is to extract some rays’ data (original direction, final direction, total distance, etc…) that hit a particular surface or go through a defined volume. The tricky part is that the number of rays that will hit this surface is not known since it depends on the scene thus we only have a higher bound. Also we may compute several reflected rays for every ray that hit a reflective surface. I fear that if we allocate up front a buffer big enough to store every possible ray we will not have enough memory on the GPU. So, would it possible to have something like an array that we can fill progressively and stop all computations when it reach a certain amount which would avoid data race ?

The second need of the client is very similar to the first. He wants to store some data of rays that hit a particular surface but by merging the incoming ray by small square surfaces. So for example, we will have a rectangle and each time a ray hit this rectangle we add the ray’s “power” to the corresponding “pixel” using UV coordinate of the rectangle. This time, to solve data race, it seems CUDA’ s atomic operation might solve this problem to some extends. Yet, it seems that if we want to store more than one scalar (for example, storing ray’s distance along its power) we still have data race since atomic work only on scalar. So, does it exist something similar to atomic functions but that can update larger data structure than 32 or 64 bit words?

At the moment I don’t know where to look at to solve these problems, so I would appreciate if someone can give me possible solutions and pointers to the documentation to get started.


Bonjour camille.viot,

These two ideas have multiple right answers, and they shouldn’t be too difficult at all, once you have a view on how to structure your application. I hope I can help. There are many, many people writing programs that implement exactly these ideas.

First, you do have control over allocating buffers that you can write to while inside your OptiX ray-gen and closest-hit programs, so you can easily create a buffer (array) that you fill progressively. If you want to do this dynamically, you certainly could use atomics to fill only the valid entries as you find them. But I recommend instead of using atomics, first try to do your reduction pass as a separate launch, if you can. So, first launch a batch of rays with an output buffer that has one entry per ray, and second, after that launch completes, launch a new program that collects the valid outputs, computes the reduction you need, and saves it to a second buffer.

To limit your data, you can break your computation into multiple launches. This is the default mode people use for progressive rendering; for example, to get 100 samples per pixel, you might trace 4 samples per pixel in one launch, and then the next launch accumulates the results into the output buffer. Then repeat those two launches 25 times until you’ve collected a total of 100 samples. In OptiX, you can see an example of this in the SDK sample called “optixPathTracer”.

To get the best performance, you will want to keep your batches as large as possible without running out of memory.

For your second issue, I’ve already mentioned one approach to consider: use separate launches, one for tracing rays, and another one for reduction. One way to think about the reduction is you can set the launch size & output size for your reduction to the size of the reduced batch of outputs, and then for each reduced output you want, collect the samples you need from the first launch. This is sometimes called a “gather” pass. If all reduction threads are about the same size, and if it’s not difficult to figure out the indexing you need, then doing it this way will allow you to avoid using atomics at all, and achieve higher performance.

Aside from separate launches with a gather pass, there are CUDA atomics you can use for updating more than 64 bit words within a single warp (as long as the data reduction can be contained within a single warp). Have a look at the warp-vote and warp-match functions in the CUDA programming guide ( These can be used to nominate a single thread within a warp to do whatever work you want at some syncronized point in your code. You will be able to create a code block that only one thread runs while the other threads in the warp wait. Obviously, if you do that you will want to absolutely minimize how much work you do inside that block.

I hope that helps, let me know if I wasn’t clear enough.


Note that these warp wide programming functions or access to shared memory is not allowed inside OptiX device code.
Means these methods require native CUDA kernels running in your application on the resulting data.

Hi David,

I really appreciate the effort you put in the answer. Unfortunately, I still fail to see how to use gather and reduction without atomics in my case.

For example, in the case where we want to fill a texture by merging rays that hit a quad, I don’t see how to merge them without synchronization. In the path tracer example it is trivial since we know, when creating a ray, where the ray contribution will be in the output texture because rays are shoot “from” the output texture. In my case, rays are shoot from a source and might hit a quad anywhere (including not hitting it at all). So, is the only solution available to not use atomics is to invert the process (ie, shoot ray from the texture to the source instead instead of shooting ray from the source)?

Also, when using reduction to only extracts some rays’ data, I can only imagine a case where I wouldn’t have to use atomics but I wonder if it wouldn’t be slower. From what I understand you recommended to first shoot some rays and write them to an output buffer.

So basically we end up with something like


where X represents ray’s data that hit the target and . represents default/uninitiated memory because the corresponding ray missed the target. This is possible to compute this without atomics but in this case we won’t know the number of Xs.

Then, the best that can be done without atomics is to filter the array to get something like


(and for each value of the array we want to “sort” we have to read the entire buffer but we can do it in parallel) and then read the array in host code to extract only the first valid Xs. But since we still don’t know the number of valid rays, we have to read each entry one by one and stop until we found the fist invalid data. It seems that there is quite a bit of overhead in contrast with the case where we use atomics. So, will it be faster this way?

EDIT: Also, when launching my current simulation with a somewhat big buffer (5000000 elements made of a custom struct of 2 float) and a launch context of (3000x3000) I got the following error:

OptiX Error: 'Unknown error (Details: Function "_rtContextLaunch2D" caught exception: Encountered a CUDA error: cudaDriver().CuEventSynchronize( m_event ) returned (719): Launch failed)'

But when lowering the output buffer’s size or the launch’s context size, the error disappears. My wild guess, is that with those settings the program tries to allocate too much GPU memory. Is there a way to make sure of that?

I probably don’t understand your setup & question enough to make specific recommendations yet. My suggestion to use reductions over separate launches if you can is just a generic idea, and it isn’t possible in all cases. Even if it can work, it is not always easy. I imagined an output texture attached to some kind of camera or receiver. Maybe I got the wrong impression. So…

What exactly is the output texture you need? Are you saying that you have some kind of light or general ray emitter that is not a camera, and a non-flat mesh of quads in space, where each quad is a texel in the texture you’re trying to compute? You’d like to capture information at each quad in space that may correspond to multiple rays? I also didn’t understand the role of reflection rays in the original question, should they accumulate in the texture just like primary rays, or do you want to do something different with reflections?

Right now the two descriptions you’ve made – if I understand correctly – to me they sound suspiciously close to “Density Estimation” (by Bruce Walter) and/or “Photon Mapping” (by Henrik Wann Jensen). Those techniques are probably more expensive than accumulation with atomics would be, but can offer higher quality results. These techniques would involve several separate launches - one to fire a batch of rays, one to collect the result and build a hierarchical data structure, and one to reconstruct the output texture using lookups in the hierarchical data structure. In terms of algorithmic complexity, this is probably more or less equivalent to doing a sort, much the same way you outlined it. (But in terms of developer effort, they are much more difficult than a simple array sort.) If you don’t need any filtering, then you are right, using atomics to accumulate into each texel will be faster.

Maybe something similar to light baking is what you’re looking for?

Thanks again for taking time to answer me.

That is almost think, I guess. I will try to word it differently. I have a general ray emitter that is not a camera and a mesh made of triangles, not necessarily flat but with proper UV mapping (I will call it the receiver mesh). Then, my ray emitter (just a point in 3D space) shoots rays in every direction and I would like to generate a “texture” (with a single channel) for the receiver mesh, representing the contributions of the rays that hit it. There are also other meshes in the scene but they are only used to do reflection. And reflected should not be handled differently than primaries. They just update the ray’s payload (updating the origin, direction and intensity). Now, I wonder how to do that? I succeeded in 2 different ways:

The first solution:

1-Default initialize a buffer big enough to store every ray’s result.
2-If a ray hit the receiver, store its contribution in the buffer using it launch index.
3-Restart the context with a different program with the launch dimension of the texture.
4-In this new program iterate over the entire buffer and sum up every contribution whose UV match the launch index then write the result to the texture.

I also found a way with atomics:

1-Shoots rays.
2-If a ray hit the receiver, get the UV’s coordinates of the hit location.
3-Use atomicAdd to add the ray’s contribution to the existing contribution in the texture.

Over theses solutions (and maybe others I didn’t think of), what is the one you would recommend generally?

Also, just to be clear, in the 2nd case, it seems like I need atomic because if I do something like this

texture[uv] = texture[uv] + ray_contribution;

I might have data races since this same line can be executed by multiples “kernels” at the same time with the same UV index.

Also, if the texture has multiples channels or is a buffer containing a user defined struct, would be harder to update? From my naive point of view if the channels (or fields) are correlated they need to be retrieved and updated all at once otherwise there might be data race too.

Knowing the best way to do this would solve 50% of my questions.

Then, I have a different problem but in the exact same context (still a general emitter, a receiver and other meshes used for reflections): I would like to save some data about each ray that hit the target but without reducing it. To put it another way, every single ray’s data (origin, total traveled distance, starting orientation, ending orientation, etc…) that hit the target should be saved. Now, in my scene, 80% of the rays (included the reflected ones) never hit the target. So if I create a big array that can hold every ray’s data (let’s forget memory limitation here) I would have to filter out a big part of it. Now, I wonder what is the best way to retrieve only the meaningful data. I have though of 3 ways:

1- Just write to the output buffer using the ray launch’s index and filter data on the CPU.
2- Write to a intermediate buffer using the ray launch’s index and filter data on the GPU using a second ray generation program (so basically doing the same thing on the CPU but on the GPU).
3- Use a global counter accessed and updated through atomic functions to store meaningful ray’s sequentially starting at the beginning of the output buffer. That way, the I know the number of ray’s that hit the target and the data is stored sequentially.

Then again, what you would recommend here?

Thanks for the links, I will read the papers in depth and surely learn quite a bit in the process.

It seems like you have several potentially very good solutions for both problems! You are correct, you do need atomics in your second solution. For the first one, you do have the option to use a fixed launch size and launch multiple times, right? No need to necessarily guarantee your launch buffer is big enough to hold all the ray data; you could slice it into several passes if you have more rays than memory?

The best way depends pretty heavily on very specific details of your setup. If you have both methods implemented, my best suggestion would be to scale up a bit and do some timings to find which is faster. I know that’s not as helpful as you would like, but it is very easy to cause huge changes in performance due to your payload size or memory traffic in your closest hit program or other architectural details of your program, so we don’t know what’s best without profiling.

For using atomics to update a larger structure, if several atomic instructions in a row gets slow, you could use CUDA interop for your reduction; save the OptiX results to a launch-sized buffer, and then do a CUDA launch where you can do the filtering/reduction using the CUDA intrinsics for block level locking. This set of slides has a pretty good overview of multiple common techniques:

I would first maybe try making a separate lock for every texel in your output texture, rather than a single lock for access to the texture. I expect using locks to write to a struct or locking on a larger block of code to be somewhat slow. “If” you can figure out clever ways to do your indexing to relieve contention, or break your work into several passes that don’t need per-thread locking, you might be better off. But, some algorithms do require some contention and are difficult to parallelize.

For your last 3 options, I would expect the first one (filtering on the CPU) to be much slower than keeping data on the GPU. Which is better of options 2 and 3 depend on workload. Number 3 sounds easiest to try. Will option 2 require a global counter/atomic anyway during the GPU filter pass, or will it not require an atomic to filter the data? If option 2 can be implemented without an atomic, I would speculate that this one could be fastest.