Alternative algorithms for accelerating occlusion traces

I built a simulation for interior light and sound simulation that uses the progressive Monte Carlo algorithm. I used the optixPathTracer example as a starting point so my algorithms still has the same basic structure as the one in the example.

The light calculation uses only a single occlusion trace(direct sun light) in the __closesthit__radiance program and combines the radiance value from it with the radiance value from the __miss__radiance program (indirect lighting).

For the sound calculation that is happening in the same pipeline (only writes into a different buffer) i loop through all the sound sources and trace occlusion for them. This however is really inefficient since i have 200+ sound sources and basically do occlusion traces for all of them every time the __closesthit__radiance program runs. I already tried limiting the number of traces by tracing only lights within a certain distance, but the gains were only minor.

All in all it still computes really fast. The problem is, that I’m using the simulation data as part of the training environment for a reinforcement algorithm. That means performance is really critical, since it can easily double the training time. So i was wondering if there are other algorithmic approaches to this problem. I’d be glad for any suggestions! Thanks in advance!

The first thing would be to try optimizing the occlusion rays.

The OptiX 7.4.0 optixPathTracer is only handling opaque materials (no cutout opacity) so an occlusion ray would not require a closesthit or anyhit program but only a hardcoded miss program, but the example code uses a closesthit program instead.
Compare the optixPathTracer occlusion implementation with this code:
https://forums.developer.nvidia.com/t/anyhit-program-as-shadow-ray-with-optix-7-2/181312/2

It’s probably not a big difference since either reaching the closesthit or the miss program ends the current ray and calls back into the streaming multiprocessors anyway and both only set a single payload register.
There is no need for that inlined setPayloadOcclusion() function either, but that should also collapse to the same assembly, it’s just unnecessary code in my opinion.

When using the optixPathTracer as basis for your code, are you still using the ray generation program with the for-loop over the number of samples?
I would change that to sample only one path per launch index then.

Are you actually tracing a full path of radiance rays through the scene for the lighting and then do the same again and sample all 200+ on each closest hit event along any path?
Even if the occlusion/visibility rays described inside the link above are the fastest rays you can have because they stay completely on the hardware RT cores until they terminate, this still sounds like an absurd amount of total rays per launch.

Have you counted how many rays you’re actually shooting per launch in your scene?
Is the sound simulation running at the same resolution as the radiance simulation?

The simplest approach to speed this up would obviously be using faster hardware (depending on what you’re currently using) and distributing the workload to multiple GPUs.

If you need final “images” of your sound simulation and you’re using a progressive Monte Carlo algorithm to solve the sound propagation similar to lighting, there are basically the two ways of launching once and simulating every source versus launching more often and picking only a subset of the sources (at least 1).
In the end, the number of rays shot to accumulate a sufficiently low variance result won’t change with those approaches.

If you say picking only nearer sound sources for the sampling didn’t help either, what does that mean?
You didn’t get a low enough variance to be used for you training input, or that required more iterations to get to sufficient result?

For lighting calculations there are some ways to reduce this variance, for example by using multiple-importance sampling. Not sure how the material (BRDF) behavior would change for sound wavelength though.
If you’re only using perfectly diffuse reflection, what is all the optixPathTracer example handles, that might just work, but is theoretically incorrect for sound which would behave very different when hitting geometry smaller than its wavelengths.

There are also algorithms which help with many light sampling like ReSTIR, but I don’t know if that can easily be applied to your sound sources.

There are also different light transport algorithms like bi-directional path tracing or photon mapping and the combination of these in Vertex Connection and Merging algorithms which can solve many problematic light transport events.

1 Like

Hey thanks again for all the hints on how to solve this. To give some more information:

Are you actually tracing a full path of radiance rays through the scene for the lighting and then do the same again and sample all 200+ on each closest hit event along any path?

Yes this is what i did as first approach. In know its not a good solution so i tried different things. The count of my generated rays is basically 200 * 200 (pixels) * 16 (samples per launch) * aprox. 3 (path_depth). In the closest hit shader another ray (sun occlusion trace for each hit) and 200 (sound sources occlusion trace) traces per ray(!) from ray generation get added. To reduce the 200 occlusion traces i tried:

  1. limiting the traces by distance between sound and and hit location, result: quality was fine, the speed increase was there but not significant, if the limit is to low the results are not “realistic” any more
  2. using only a random selection of the 200 sound sources per hit, result: not yet clear, have to test further

Is the sound simulation running at the same resolution as the radiance simulation?

Yes it is but because of the many occlusion rays i get much higher visual smoothness that is unnecessary.

If you’re only using perfectly diffuse reflection, what is all the optixPathTracer example handles, that might just work, but is theoretically incorrect for sound which would behave very different when hitting geometry smaller than its wavelengths.

Yes for now its only an approximation of real physics, which i plan to correct at later stages.

The simplest approach to speed this up would obviously be using faster hardware (depending on what you’re currently using) and distributing the workload to multiple GPUs.

Right now im on a single RTX 2080ti. Were are looking into using a setup with multiple a5000 (thanks to your explanation in another post) for this and other use-cases. Im not sure though if this would be a real speed up since the output is only 200 by 200 pixels so im not sure if the calculations even “fill” the cores of a single GPU.

Thanks for the algorithm recommendations i will look into them!

Would it be a possibility to apply denoising to grayscale images this small (sample attached) to get away with fewer iterations? Im basically rendering a “floorplan style” time-series with estimates for GI and sound propagation over a day. Is there the possibility to fine tune the denoiser to different kinds of images by providing input and ground truth?
trace_example

Again thanks for the help!

So in the worst case you shoot 200 x 200 x 16 x 3 x (1 + 200) = 385,920,000 rays per launch. That’s quite some work.

I’m not sure though if this would be a real speed up since the output is only 200 by 200 pixels so I’m not sure if the calculations even “fill” the cores of a single GPU.

Depending on the required resources, especially the number of registers used inside a kernel, a launch size of 200 x 200 is slightly above of the minimal size which can saturate a high-end GPU when using 128 registers, for example.
It’s GPU specific and a little complicated: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-multithreading

I would recommend using Nsight Compute to check how the program behaves now.

If you’re always shooting 16 paths per launch index and your launch dimension is not saturating the GPU enough, you could also make launch size 16 times bigger and shoot only one path per launch index without accumulation and accumulate the results in a native CUDA kernel afterwards using some parallel reduction.

Splitting the results over multiple GPUs could work the same way, e.g. with two GPUs each of them only shoots 8 paths per launch index and then you would need to copy the result from one device to the other and accumulate there.
The data size is not that big to transfer over the PCI-E bus. Though still any of those additional operations will add runtime cost and the more transfers are required the higher the overhead.

Is there the possibility to fine tune the denoiser to different kinds of images by providing input and ground truth?

The OptiX denoiser is meant to reduce the high-frequency noise of progressive Monte Carlo algorithm. I don’t see why that shouldn’t work on the greyscale image above. Mind that the denoiser only accepts 3- or 4-component half and float formats as input.
There are multiple algorithms implemented which changed over time and keep changing and even though there exists an API to specify trained network data, there is no way provided to actually train your own network.

1 Like

Thanks for the explanation on spreading the compute operations! That’s real good advice. Also i didn’t consider using Nsight Compute yet.

…and accumulate the results in a native CUDA kernel afterwards

I don’t really have experience with the underlying CUDA aspects. Where would be the point to do this?
Thanks a lot again for the in depth answers and help!

Depending on the CUDA runtime or driver API, running native CUDA Kernels would use different calls to launch them.
You’re using CUDA host API calls inside your OptiX 7 application anyways, so the required environment to also run native CUDA kernels is already present and it’s just a matter of getting them to work inside your project’s build environment.

Inside the OptiX 7 SDKs, the optixRaycasting example demonstrates how to use native CUDA kernels with the CUDA runtime API.

For such an accumulation you could simply launch the respective CUDA kernel on the same stream you used to render your results data once after your progressive OptiX launches have been done. Since launches are asynchronous using the same stream means it will immediately run after the last OptiX launch finished.

For more information on CUDA programming please refer to the CUDA Programming Manual, esp. the Programming Model and the Runtime chapters when using the runtime API.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model
Then have a look at the CUDA examples inside the toolkit.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.