Fill output buffer from multiple threads

Hi all,

I’m writing a path tracer which primarily focuses on exporting a final image rather than updating a viewer which means I do a single launch rather than an iterative as is common in the optix samples. At the moment I’m looping through the sample count per pixel in the raygen program and then I’m writing the average of all samples to the output buffer. Now this feels rather inefficient utilization wise, given the fact that some pixels will have all samples only invocating the miss program while others will have multiple bounces and light samples to trace for all pixel samples.

Therefore I’m thinking that I can use the z dimension of the launch for the sample count rather than looping in the raygen. But I’m unsure of how to then efficiently get the correct color in my output buffer.

I know that when iterating through samples you can just lerp the color that’s already in the buffer with what is coming from that sample like @droettger does in the Optix apps.

if (0 < sysParameter.iterationIndex)
{
const float4 dst = sysParameter.outputBuffer[index]; // RGBA32F
radiance = lerp(make_float3(dst), radiance, 1.0f / float(sysParameter.iterationIndex + 1));
}

If I’m firing all at once I would know the sample index but not how many samples that has already written to the buffer. So my question is really if there is a standard way to solve this problem.

The options I’m considering so far would be to either have a separate buffer which holds a counter on how many samples have been written to the output and do some atomic operation when lerping/incrementing the counter. Or I could add all samples with atomic and then do a second pass after the rendering where I divide the output buffer by the sample count.

Thanks,
Oscar

One approach I have used, is to work in tiles and shoot all samples per pixels at once.
To get good GPU load per tile I used around a million rays per tile, for example at 256 spp that gives 64x64 tiles. Depends on the GPU how much is necessary to fully saturate the GPU.
To do that I used a 2D launch where the width was the samples per pixels and the height was the number per pixels in the tile, (means each row is the linear index of the pixel in the tile). With the above example that is a 2D launch of 256 * 4096 cells.)
That writes all samples into different memory locations and then that resulting buffer can be very quickly accumulated over the sample per pixel (rows) with a native CUDA kernel (simple reduction kernel over the width) and the result written into the final tile location on the full screen buffer.
No atomics needed, good locality of the samples, completely asynchronous launches in OptiX 7.

Read the warning about how much work you can do in a single launch at the end of this most current thread.
https://forums.developer.nvidia.com/t/task-scheduling-in-optix-7/167050/5

Thanks, makes sense! I will have a go.

Is the use of Atomics and multiple threads writing to the same memory generally a bad idea compared to unique memory location and reduction?

Is the use of Atomics and multiple threads writing to the same memory generally a bad idea compared to unique memory location and reduction?

Depends on how often the atomic blocks other threads. It can’t be faster than a standard write. Sometimes you cannot avoid it.
Also since there are no vectorized atomics, you would need to use one atomic for each color component.
Multi-GPU is another topic which requires attention.

Check this post: https://forums.developer.nvidia.com/t/best-strategy-for-splatting-image-for-bidir/111000/2

1 Like

Depends on the GPU how much is necessary to fully saturate the GPU.

Is there any documentation on this or a way to calculate what’s required for a good saturation?

To get good GPU load per tile I used around a million rays per tile, for example at 256 spp that gives 64x64 tiles.

Would be enough for an RTX A6000 and all previous workstation GPUs?

For the GPU you have installed you can query the GPU load while running your renderer by using query functions of the NVIDIA-SMI tool which gets installed along with the standard display drivers. Not sure if that is also inside the DCH drivers.

E.g. this would print the name and GPU load for every installed device every second:
"C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe" --format=csv,noheader --query-gpu=name,utilization.gpu --loop-ms=1000

For a general answer, it’s tricky. This really depends on what you programmed.

It’s rather complicated as can be seen by the number of parameters available inside the CUDA Occupancy Calculator Excel sheet, and you do not have control over the scheduling in OptiX, so this this is not really helpful There is no single magic number.
But as a rule of thumb, anything below 64k threads will normally be too small to saturate a high-end GPU today, so a million threads per launch isn’t a bad start.

It’s not like you would need to hard-code that but could make that configurable and let the end user determine what works best.
It’s a balance between performance and crash if you hit the 2 second Timeout Detection and Recovery (TDR) under Windows.

I would really recommend implementing it with full resolution launches and one sample per launch first to be able to determine if any other workload distribution is actually worth it.

You can make the update of the sub-frame (iteration) index asynchronous as well and submit a lot of launches to the CUDA stream and then only synchronize once at the end.
It’s similar to the benchmark mode in my example apps
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/rtigo3/src/Application.cpp#L475
and this comment solved:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/rtigo3/src/DeviceSingleGPU.cpp#L153

Handy script! Found nvidia-smi.exe in C:\Windows\System32.

I’ve never hit the 2 second timeout on any of the different hardwares (all Windows 10) I’ve run on so far. The app I initiate optix through might be overriding it, not sure.

Just for testing I can launch my previous configuration with 400 spp (where I loop the samples in the raygen program) over 1920x1080 on Laptop X1 with GTX 1050 Ti (has an additional Intel UHD Graphics 630 GPU). It takes about a minute for the GPU but no problems. On workstation with single RTX A6000, same scene, same res but 10000 spp, takes 30 sec, still no complaints.

If I’m not bound by the 2 sec TDR (need to research this a bit) then the best thing performance wise should be to launch as I mentioned initially using z launch dimension as sample index, write to unique memory location and then run a reduce kernel? While keeping track of the maximum launch size. This would obviously not give any sample locality but it would maximise launch size and minimise the amount of times I have to upload my launch parameters and call optixLaunch.

I’ll try moving to full resolution launches and one sample per launch first to test.

Thanks again for your answers @droettger!

Handy script!

Check its manual. It can do a lot more.

I’ve never hit the 2 second timeout on any of the different hardwares (all Windows 10) I’ve run on so far. The app I initiate optix through might be overriding it, not sure.

You cannot override the TdrDelay without reboot.
Microsoft states changing it is only allowed for debug purposes and not in shipping applications.

It depends on the driver and GPU configuration if this can be prevented or not, but I wouldn’t risk it.

If I’m not bound by the 2 sec TDR (need to research this a bit) then the best thing performance wise should be to launch as I mentioned initially using z launch dimension as sample index, write to unique memory location and then run a reduce kernel?
This would obviously not give any sample locality but it would maximise launch size and minimise the amount of times I have to upload my launch parameters and call optixLaunch.

I would not recommend exactly that
The launch calls themselves are asynchronous and take only micro-seconds. There is not really a win in launching a huge number of threads vs. a reasonable amount of launches with fewer threads. Doing less work more often is the more robust approach.

You wouldn’t have the memory to store all data in a 3D buffer when not accumulating.
Also note that the ray tracing launch sizes are limited basically to a total sum of 2^30.
We had that discussion before: https://forums.developer.nvidia.com/t/3d-optixlaunch-to-accommodate-multiple-viewpoints/160421

Right, this is also not recommended because it will result in bad memory access patterns when looping over the z-dimension per thread.
This is actually a case where accumulation with atomicAdd into a single 2D slice would be reasonable. (Needs a final division by sample count.)