optixLaunch configuration revisited

wenzel.jakob · December 16, 2021, 10:36pm

Dear OptiX team,

the OptiX optixLaunch() function provides launches with a configurable width, height, and depth, which can then be conveniently queried in the ray generation shader.

In a previous question (Launch dimensions in LaunchContextnD and optixLaunch), @dhart mentioned that optix uses this information arrange computation for 2D launches in tiles, presumably to exploit coherence in primary rays.

I am wondering: what are the best practices for optix kernel that computes multiple samples per pixel? How do I get the best mapping from launch indices to CUDA cores in this case? For instance, should I set the launch depth to the sample count? Or is it better to put the sample count in the ‘width’ position of the optix launch, since it is perhaps the most important dimension for capturing coherence?

I am also curious to know if any of the best practices change when the ray generation program contains a full path tracer loop that does a sequence of ray tracing calls (these tend to become incoherent after the first 1-2 bounces).

Currently, we just do 1D launches and this of course works fine, but I am wondering if we are leaving performance on the table. (The last time I tried to use the launch configuration in a better way, I actually saw some small performance regressions, thought I may have misused the feature)

Thank you,
Wenzel

dhart · December 17, 2021, 12:05am

Hi @wenzel.jakob, nice to see you!

These are good questions. It depends a bit on what you need, but I’ll try to clarify several scenarios.

For multiple samples, do you want/need each sample’s result to end up in a separate pixel? Or can you average the result of multiple samples in a single thread at the end of raygen before writing the result?

Your question suggests you may want the former. Our SDK sample optixPathTracer does the latter. If you can avoid writing individual samples, and instead perform the reduction in a single thread then we would generally expect this to be faster than storing the results for each sample separately, regardless of the thread-pixel mapping. It’s perhaps assumed that each pixel in this case takes the same number of samples, to minimize thread divergence.

We tile the 2d launch into warp-sized tiles in OptiX because there is usually some performance benefit from doing so for primary rays, and it almost never hurts. And yes, the benefit comes from ray coherence: usually memory loads and cache hit rates for BVH, geometry, shaders, and textures are all higher for rays that are very similar compared to rays that go different directions, just because the rays tend to traverse similar spaces and hit similar locations in the scene. For multi-bounce path tracing, especially for diffuse materials, this tiling scheme doesn’t provide much benefit, and I don’t have any clever tricks or good advice to offer in that case- the speedup is mostly determined by the camera rays, and the amount of speedup you get may be limited by your average path depth.

With a 1d launch, we don’t do any tiling, but you can do this tiling yourself, and verify whether you see some benefit. If you need to store each sample separately, then whatever your mapping is, try to arrange it so that consecutive thread ids correspond to very similar rays. This would mean putting (for example) jittered primary ray samples consecutively.

With an OptiX 2d launch, if you use width*height*numSamples number of threads, then you would probably want to arrange your samples into blocks or tiles so all samples for a pixel are in a tile, which would happens automatically if you render a large image. For example, if you render a 1080p image with 16 samples per pixel, you could use a 2d launch of (4*1920, 4*1080) == (7680, 4320). This way you can get the tiling benefit without any change to your indexing at all. The only thing you’d need to do is run a separate reduction kernel on the 4x4 image tiles.

We don’t tile the 3d launch. With a 3d launch, given a launch size specified by (width, height, depth), by default the launch is depth-major, followed by row-major, so the X (or width) coordinate is the inner-most index. This means you’d want to map your multiple camera ray samples to the X / width parameter in order to have coherent rays grouped into warps.

So, advice is to not map samples to individual threads, if you can (understanding that this might not be possible in your case if your postprocessing reduction is more sophisticated than a simple averaging and/or if pixels need data from neighboring pixels). Not mapping samples to threads would mean use width*height number of threads for your launch, and for samples use a loop in raygen. The next easiest fallback is to render a larger image with a 2d launch that you can downsample/reduce with a post-kernel, but comes with the downside of needing a constant, rectangular number of sub-samples. Third would be a 1d or 3d launch where you pay attention to the indexing and put neighboring samples into neighboring threads.

Does that help? Let me know if any of it is unclear or reveals more questions.

–
David.

droettger · December 17, 2021, 8:02am

We don’t tile the 3d launch. With a 3d launch, given a launch size specified by (width, height, depth), by default the launch is depth-major,

Note that this detail actually allows 3D launches of width * height * samples per pixel dimensions without writing the individual results into separate 3D launch indices and a post-process for the accumulation.
Instead you can accumulate each sample per pixel on the fly into a 2D width * height output buffer using atomics, and these are not going to slow down much because the 3D launch indices are handled as 2D slices, means there will never be any congestion of the atomics running over the z-dimension at reasonably sizes.

Note that the maximum launch dimension in OptiX is 2^30.

I’ve written a tile-based renderer which launched all samples per pixel for each tile (with launch dimension of around 1M launch indices) with separate output locations and a native CUDA accumulation kernel and the result was NOT faster than rendering full 2D images for each sample with accumulation inside the raygeneration program, which means the scheduler in OptiX worked well.
Also trying to optimize with this information is essentially relying on implementation dependent behavior.

For interactive workloads I would not recommend doing that. Long running kernels under Windows WDDM are usually bad. That’s something better suited for compute-only devices. Doing less work more often, which in this case for example could mean one 2D launch per sample or smaller tiles, would result in better interactivity.
OptiX launches are asynchronous and the launch overhead is a few microseconds. This can also fill the CUDA stream with enough work to make things non-interactive but would prevent Windows WDDM issues.

wenzel.jakob · December 17, 2021, 10:52am

Thanks so much and @dhart and @droettger, this is very useful information.

Just for clarity, when we are referring to 1D or 2D launches, these are optixLaunch commands where the trailing dimensions are 1? (IIRC there used to be optixLaunch comments of different dimensions in pre-OptiX 7 times, but these don’t exist anymore)

Now regarding what we do, I suspect it is probably crazy/different from typical OptiX usage:

we render N monte carlo samples per pixel (where N might be relatively big, say, 1024). This is on linux where WDDM isn’t an issue, and the 2^30 launch limit is definitely in sight (it’s no problem to do multiple passes of course).
each pixel sample is accumulated into multiple pixels based on a pixel reconstruction filter (gaussian, mitchell, etc.). This is the same image reconstruction approach also taken by PBRTv3 or Mitsuba on the CPU (in fact, the OptiX version of our renderer is generated by a JIT compiler based on the existing CPU rendering code).

This does a lot of atomic operations: for a 4x4 reconstruction kernel, and RGB, alpha, weight output channels, we have 4x4x5 = 80 atomic scatter-adds per sample!

Millions of threads hammering global memory using atomic memory operations–what could possibly go wrong? It’s incredibly impressive that this is actually quite performant on NVIDIA hardware (switching to a box filter only makes a small difference, usually < 5% of the total render time).

The reconstruction filter is actually a quite critical aspect for us, because we are differentiating the rendering process to run gradient-based optimization algorithms. A box reconstruction filter would not be differentiable in its position argument and therefore produce incorrect results.

A few more follow ups:

I would be curious if you have feedback on the atomic sample splatting – is what we do reasonable given the requirement of using a non-box pixel reconstruction filter?
What is the tile size used by OptiX for 2D launches? 4x4 for a 16 samples per pixel image was mentioned, but the SIMD width of CUDA ALUs is 32, correct?
Right now, we use a 1D launch where samples within a pixel are next to each other (fastest), then rows, then columns of the image. However, if I interpret your suggestions above, it sounds like a 2D launch of shape (sqrt(samples_per_pixel) * width, sqrt(samples_per_pixel) * height, 1) might reap some additional benefits from ray coherence. Did I understand this correctly?

Thanks again!

droettger · December 17, 2021, 1:32pm

when we are referring to 1D or 2D launches, these are optixLaunch commands where the trailing dimensions are 1

Correct, 1D launches are (width, 1, 1) and 2D launches are (width, height, 1).

I would be curious if you have feedback on the atomic sample splatting – is what we do reasonable given the requirement of using a non-box pixel reconstruction filter?

There is not much to do about that when there is no 1-to-1 relationship between launch indices and result cells, that is, when not using gather algorithms.
Scattering algorithms require atomics because there is no information about neighboring launch indices available in a single launch with OptiX’ single ray programming model. That information would only be available between launches and you could do whatever you want with the current results in native CUDA kernels or as input to other OptiX launches.

What is the tile size used by OptiX for 2D launches?

The warp-sized blocks are 32x1 in 1D and usually 8x4 in 2D and 3D.
You can see them as blocky corruption inside the image when you forget to initialize some per ray payload ;-)
Or when visualizing the clocks taken for each launch index. Some examples have a “time view” feature which looks like this:
https://developer.nvidia.com/blog/profiling-dxr-shaders-with-timer-instrumentation/

Right now, we use a 1D launch where samples within a pixel are next to each other (fastest), then rows, then columns of the image. However, if I interpret your suggestions above, it sounds like a 2D launch of shape (sqrt(samples_per_pixel) * width, sqrt(samples_per_pixel) * height, 1) might reap some additional benefits from ray coherence. Did I understand this correctly?

1D launches are not tiled. They are simply run in 32x1 blocks for each warp and if the same pixel gets handled in a full warp that would spatially be optimal.

The other ideas to make the launch dimension a super-resolution of the image, David described above, would result in a similar spatial ordering of the SPP launch indices to warps due to the 8x4 blocks as long as that covers pixel equally, but that wouldn’t be as perfect for all samples per pixel sizes.

The 3D launch would revisit some of the 2D launch indices multiple times so that would also not be spatially perfect for the launch indices to warp assignments. Accumulating the SPP into a 2D image from a 3D launch is just a method to reduce the number of launches. It would actually be slower when iterating in z-dimension per launch index because that would be the worst memory access pattern.

If you use a scattered write algorithm anyway, I think your 1D launch is just fine.
These ideas were more along the lines of having gather algorithms and a native CUDA post-processing kernel in which you could use all available CUDA features like shared memory to speed up the accumulation.

You could always experiment with a different spatial assignment of these 1D launch indices to 2D pixels in your image, simulating any block layout you want or space filling curves. It would be interesting to see if that would affect the congestion of the atomics. My gut feeling is that the 1D launch has less congestion than a 2D launch if you need to fill in square kernels around each pixel but that would need to be benchmarked.

dhart · December 17, 2021, 4:14pm

So a scatter with atomics really changes everything. ;) So far I’ve personally avoided atomics so much that my experience with them is limited.

Avoiding atomic contention requires making sure adjacent threads executing at the same time don’t all try to lock the same resource at the same time, while the coherence benefit with tiling requires making sure adjacent threads in a warp are all trying to read from the same memory at the same time. There’s a built-in conflict here, and I’m guessing that having a write atomic per sample might outweigh the benefits of tiling coherence. It seems entirely possible you could get the best performance out of an incoherent ray workload, by avoiding any tiling. @droettger’s point about using a 3d launch with 2d indexing is good- less restrictive on number of samples than a large super-sampled image, though I would guess with an atomic write per sample you might want to do the opposite of what I suggested above and order your indices to ensure rays are far apart.

If you were to separate your passes into a render of sample data followed by a gather, how much data would you need to store per sample? You could avoid atomics this way, though I would guess this would not be faster than atomics. Possibly much slower if the data is large, but maybe it’s not out of the question to consider if your sample data is very small.

–
David.

wenzel.jakob · December 19, 2021, 10:08pm

Thank you @dhart and @droettger for these helpful suggestions!

Sharing in case it is interesting to you: limited experimentation thus far indicates that global memory atomics on the Turing architecture are fast! Simply removing all of the atomics and writing out the raw per-sample information (image space position, RGB value) ends up being slightly slower (~1%) compared to merging those samples into a single image buffer using global atomics issued from within the raygen program.

(I am speculating here, but perhaps this is due to the significantly larger number of writes that need to pass through L2 and go to global memory, whereas the cache can be more effective when writing to a comparably small output image?)

Note that this is even missing the cost of an additional kernel that would be needed to perform a separate sample accumulation, so the slowdown would likely be greater in practice.

Altogether, I am quite surprised by the reasonable performance of the naïve approach we are currently using, which seems to violate all common sense regarding atomic memory operations. I’ve read somewhere that newer NVIDIA GPUs have an ALU within the L2 cache that is used to merge contending updates, which might have have some role to play here…?

I also tried to see if reordering the wavefront indices to avoid conflicts would help (for example, to render rows, then columns, then individual samples instead of samples, then columns, then rows). This produced a roughly 5% slowdown, so the coherence on the first bounce seems to be beneficial and outweigh the cost of large numbers (4x4x5 == 80) of contending global memory atomics at the end of the ray generation program.

Details on setup: I am using a Titan RTX, rendering the ‘staircase’ scene from Benedikt Bitterli’s scene repository at 720p, 128 samples/pixel, 9 bounces.

One unrelated observation is that the official API documentation of optixLaunch is a little terse. Documenting how the launch configuration affects ray coherence (& batching into tiles) might be useful to other users of this function.

dhart · December 21, 2021, 6:43pm

Yes this is super interesting, thanks for sharing @wenzel.jakob! These results indeed are better than what I assumed. Speaking personally, there does seem to be a pattern of having naïve and brute force solutions surprise me with better performance on the GPU than all the clever tricks I’ve learned for CPU renderers over the years. It’s great that you see a perf benefit with the coherent ray tiling in the presence of your atomic scatter!

I will take the feedback on documentation to our team and maybe we can add something about the tiling, it’s a good suggestion to let people know the tiling is there, and might also help people who want to do their own tiling in the 1D and 3D launches.

–
David.

Topic		Replies	Views
Wraps in Ray gen and how data is initially stored in the memory hierarchy OptiX	13	1121	June 14, 2022
Fill output buffer from multiple threads OptiX	8	1469	October 12, 2021
Optix - Rays vs Pixels, Multiple rays/pixel OptiX	5	2121	June 14, 2022
Allowing multiple threads to process a single pixel. OptiX	5	1225	June 14, 2022
Concurrent access and growable buffer OptiX	8	914	June 14, 2022
Baking to Texture OptiX	20	3389	June 14, 2022
Launch dimensions in LaunchContextnD and optixLaunch OptiX	5	1686	October 12, 2021
rtContextLaunch1D with multiple GPUs OptiX	3	784	June 14, 2022
optiXTutorial 11 - remove (free)GLUT OptiX	37	4859	June 14, 2022
OptiX Time for Launch OptiX	9	1412	June 14, 2022

optixLaunch configuration revisited

Related topics