I’ve been observing some behaviour concerning RT_BUFFER_INPUT_OUTPUT | RT_BUFFER_GPU_LOCAL which I struggle to explain, given what I found in related posts:

To quote the explanation in the posts and the documentation, the above flags result in the following behaviour wrt buffers:

the host to only write, and the device to read and write data

Which I had assumed meant that it should be possible to write into the buffer between the map/unmap calls on the host side. The issue we’re having atm is what I can only vaguely (sorry!) describe as incorrect rendering, which happens simply when placing map/unmap calls - nothing in-between, just these two calls, that as far as I understand, shouldn’t be doing anything.

To give a better example, I tried to do the same for optixWhitted optixWhitted.cpp (20.1 KB) rendering sample that comes with Optix 6.5 SDK. Namely, the changes I introduced are as follows:

  • pull accum_buffer into global scope to line 85, ie simply define Buffer accum_buffer;
  • line 164: remove Buffer in front of accum_buffer
  • the buffer itself is defined (didn’t change anything here) as: accum_buffer = context->createBuffer( RT_BUFFER_INPUT_OUTPUT | RT_BUFFER_GPU_LOCAL, RT_FORMAT_FLOAT4, width, height );
  • inside glutDisplay() function:
    ** line 421: add accum_buffer->map();
    ** line 422: add accum_buffer->unmap();

The screen turns dark (a little bit more extreme than our incorrect rendering), but renders as usual when I move the camera around, and then turns dark again when I stop.

Could you please let me know what am I missing? In particular, I was wondering where is my understanding lacking, wrt the fact that the host should be able to write into such buffers (I assume in-between map/unmap calls).

And another related question. I realized that we have a couple of buffers defined with RT_BUFFER_INPUT_OUTPUT | RT_BUFFER_GPU_LOCAL flags, which we are both writing to (from host) and reading from (into host) using a cuda pointer obtained with buffer->getDevicePointer(optix_device_ordinal). Doesn’t appear that there are any issues there. Are there any caveats in reading from such a buffer using the above-mentioned approach for a single-GPU configuration?

I would greatly appreciate any pointers in he right direction. Thanks a lot for your time.

Setup: Win 10, Nvidia Quadro P4000, 451.48, Optix 6.5/6.0 (tested both)

A standard buffer with RT_BUFFER_INPUT_OUTPUT can be read and written on the host and the device.
Adding the RT_BUFFER_GPU_LOCAL flag means the buffer can exist on multiple devices in its full size and reading from it between map/unmap will not give you a consistent result. It’s most likely reading just one of the buffers, if at all.

Citing the OptiX 6.5.0 programming guide:
Can only be used in combination with RT_BUFFER_INPUT_OUTPUT. This restricts the host to write operations as the buffer is not copied back from the device to the host. The device is allowed read-write access. However, writes from multiple devices are not coherent, as a separate copy of the buffer resides on each device.

That would explain why a map/unmap with no write from the host breaks the buffer contents. I would expect that to upload undefined data on the unmap.

The whole idea of the RT_BUFFER_GPU_LOCAL was to have the possibility to keep buffers in VRAM in a multi-GPU configuration where standard RT_BUFFER_OUTPUT and RT_BUFFER_INPUT_OUTPUT are allocated in pinned memory on the host which all devices can access simultaneously. This doesn’t scale linearly with an increasing number of GPUs due to PCI-E bus congestion. That’s why accumulations in a progressive renderer should be done in GPU local buffers and only the final results be written to an output buffer. That eliminates the read-modify-write operations via PCI-E during the accumulation and results in better multi-GPU performance.

The direct access via getting a CUDA device pointer per device might actually be a workaround, but you would really need to test that on a multi-GPU system.

Are there any caveats in reading from such a buffer using the above-mentioned approach for a single-GPU configuration?

With a single device there is no need to use RT_BUFFER_GPU_LOCAL. Standard RT_BUFFER_INPUT_OUTPUT buffers will reside on the device.

I would use RT_BUFFER_INPUT_OUTPUT | RT_BUFFER_GPU_LOCAL buffers only as device scratch space and never access them via the host at all. That way you have consistent behavior in single and multi-GPU configurations.
That’s exactly why the accum_buffer is implemented like this inside the OptiX SDK examples. Note that the output_buffer can also be a tonemapped uchar4 buffer instead of the full float4 accumulation buffer which reduces the PCI-E bandwidth requirements for the final output even more.

That said, none of these issues exist in OptiX 7 which doesn’t know about buffers, textures, or multiple GPUs.
All that is fully under your control with native CUDA host code inside your application.
CUDA interoperability is not a special case, you work with native CUdeviceptr instead of abstractions. That also means you can work with native CUDA kernels on the same data.
OptiX 7 API functions are explicit, acceleration structure builds and launches are asynchronous, the API is multi-threading safe, and performance will generally be better overall.
If you’re working with multiple GPUs, you can implement any work distribution you need, scaling is only dependent on your algorithms. You can use NVLINK peer-to-peer transfers and share data explicitly. Nothing is happening under the hood like in OptiX 6 and before.

You’ll get the idea. I can really recommend using OptiX 7 instead.
The sticky posts in this forum contain links to more OptiX 7.0.0 example code and presentations.

PS: OptiX 7.1.0 has been released yesterday!
The OptiX 7 examples on github will need some very small changes to compile with OptiX 7.1.0 because some structures changed. I’ll adjust it shortly.

1 Like