RT_BUFFER_INPUT_OUTPUT | RT_BUFFER_GPU_LOCAL question

droettger · July 1, 2020, 8:29am

A standard buffer with RT_BUFFER_INPUT_OUTPUT can be read and written on the host and the device.
Adding the RT_BUFFER_GPU_LOCAL flag means the buffer can exist on multiple devices in its full size and reading from it between map/unmap will not give you a consistent result. It’s most likely reading just one of the buffers, if at all.

Citing the OptiX 6.5.0 programming guide:
RT_BUFFER_GPU_LOCAL
Can only be used in combination with RT_BUFFER_INPUT_OUTPUT. This restricts the host to write operations as the buffer is not copied back from the device to the host. The device is allowed read-write access. However, writes from multiple devices are not coherent, as a separate copy of the buffer resides on each device.

That would explain why a map/unmap with no write from the host breaks the buffer contents. I would expect that to upload undefined data on the unmap.

The whole idea of the RT_BUFFER_GPU_LOCAL was to have the possibility to keep buffers in VRAM in a multi-GPU configuration where standard RT_BUFFER_OUTPUT and RT_BUFFER_INPUT_OUTPUT are allocated in pinned memory on the host which all devices can access simultaneously. This doesn’t scale linearly with an increasing number of GPUs due to PCI-E bus congestion. That’s why accumulations in a progressive renderer should be done in GPU local buffers and only the final results be written to an output buffer. That eliminates the read-modify-write operations via PCI-E during the accumulation and results in better multi-GPU performance.

The direct access via getting a CUDA device pointer per device might actually be a workaround, but you would really need to test that on a multi-GPU system.

Are there any caveats in reading from such a buffer using the above-mentioned approach for a single-GPU configuration?

With a single device there is no need to use RT_BUFFER_GPU_LOCAL. Standard RT_BUFFER_INPUT_OUTPUT buffers will reside on the device.

I would use RT_BUFFER_INPUT_OUTPUT | RT_BUFFER_GPU_LOCAL buffers only as device scratch space and never access them via the host at all. That way you have consistent behavior in single and multi-GPU configurations.
That’s exactly why the accum_buffer is implemented like this inside the OptiX SDK examples. Note that the output_buffer can also be a tonemapped uchar4 buffer instead of the full float4 accumulation buffer which reduces the PCI-E bandwidth requirements for the final output even more.

That said, none of these issues exist in OptiX 7 which doesn’t know about buffers, textures, or multiple GPUs.
All that is fully under your control with native CUDA host code inside your application.
CUDA interoperability is not a special case, you work with native CUdeviceptr instead of abstractions. That also means you can work with native CUDA kernels on the same data.
OptiX 7 API functions are explicit, acceleration structure builds and launches are asynchronous, the API is multi-threading safe, and performance will generally be better overall.
If you’re working with multiple GPUs, you can implement any work distribution you need, scaling is only dependent on your algorithms. You can use NVLINK peer-to-peer transfers and share data explicitly. Nothing is happening under the hood like in OptiX 6 and before.
https://forums.developer.nvidia.com/t/porting-to-optix-7/79249/3

You’ll get the idea. I can really recommend using OptiX 7 instead.
The sticky posts in this forum contain links to more OptiX 7.0.0 example code and presentations.

PS: OptiX 7.1.0 has been released yesterday!
The OptiX 7 examples on github will need some very small changes to compile with OptiX 7.1.0 because some structures changed. I’ll adjust it shortly.

Topic		Replies	Views
Question about handling buffers when using multiple GPUs? OptiX	14	3861	June 15, 2022
Optix 6.5 - Multi-GPU OptiX	2	1302	June 14, 2022
Multi-GPU with OptiX OptiX	10	5451	June 14, 2022
Host-device transfer bottleneck OptiX	4	1072	June 14, 2022
Progressive photon mapping sample with multiple GPUs OptiX	7	1905	June 14, 2022
Host access to Buffer OptiX	9	2395	June 14, 2022
Using Texture memory OptiX	13	2261	June 14, 2022
Will RTbuffer support cache mechanism OptiX	4	769	June 14, 2022
Multi-GPU with several float buffers OptiX	5	1306	June 14, 2022
Optix 4 and CUDA interop, problems switching from Optix 3.8 OptiX	9	1173	June 14, 2022

RT_BUFFER_INPUT_OUTPUT | RT_BUFFER_GPU_LOCAL question

Related topics