Cannot read from out buffer on A100 GPU - OptiX 6.5

PPavel · June 8, 2021, 10:30pm

Hello everyone,

I’m porting an application from OptiX 5 to OptiX 6.5. The program runs fine on RTX 2080, but when I try it on A100 GPU, memory mapping doesn’t work properly for an out buffer. To be specific, I create the OptiX buffer for output, write there in OptiX ray generation program, and transfer data back to host memory:

optix::Context opx_context = optix::Context::create();
optix::Buffer out_buffer_optix = opx_context->createBuffer(RT_BUFFER_OUTPUT,
                                                           RT_FORMAT_FLOAT, n_rays);
opx_context["OUT_BUF"]->set(out_buffer_optix);

opx_context->launch(0, n_rays);

float *out_buffer_host = new float[n_rays];
memcpy(out_buffer_host, out_buffer_optix->map(), n_rays*sizeof(float));
out_buffer_optix->unmap();

In this example, out_buffer_host always ends up with all 0s, no matter what the buffer size is.
I’m sure that I actually write to the buffer in ray generation program because the output of rtPrintf(...) is correct:

RT_PROGRAM void MakeRays()
{
    OUT_BUF[r_idx] = r_idx;
    rtPrintf("out buf element: %.4f\n", OUT_BUF[r_idx]);
}

It’s worth noting that Transferring data into the input buffer works fine, all OptiX programs can access input buffer data.
Somewhat similar issue is present in OptiX 6.5 SDK samples. For example, optixHello outputs the following frame:
demo_frame (576.0 KB)

The display driver version is 450.119.04. I tried building with both cuda 10.1 and cuda 11, the results are all the same.

Can anybody help me with this issue?

Thanks in advance!

Kind regards,
Pavel

droettger · June 9, 2021, 7:04am

Would you be able to update the display drivers on the A100 configuration?

The OptiX core implementation resides inside the display driver since OptiX 6.5.0.
Means the first thing to try when inexplicable things happen, is changing the display driver.

According to your display driver version 450.119.04, that’s a Data Center/Tesla driver for Linux 64-bit with CUDA 11.0 support from May 4, 2021, which while being the newest release date, it’s not the newest driver branch. There are 460.73.01 driver with CUDA 11.2 support available as well.

Please keep using the CUDA 10.1 toolkit for compiling the OptiX PTX input code though, because that’s the CUDA toolkit version with which OptiX 6.5.0 was built (see OptiX release notes). The CUDA 11.2 toolkit might not work reliably for that.

Was the optixHello output PPM image from running the pre-compiled SDK examples or when building them yourself?
(I don’t know if the OptiX SDK 6.5.0 comes with pre-compiled examples under Linux, like it does under Windows.)

I’ll file an internal bug report.

PPavel · June 9, 2021, 7:53am

Hi @droettger ,

thanks for the suggestion. Is 460.73.01 driver compatible with CUDA 11.0?

The PPM image was generated by optixHello that I compiled using cmake. In cmake configuration, I specified gcc-8 as a host compiler, and cuda 10.1 installation as CUDA toolkit root directory.

droettger · June 9, 2021, 8:57am

Is 460.73.01 driver compatible with CUDA 11.0?

Yes, CUDA drivers are backwards compatible.
The driver release notes only mention the maximum CUDA version supported.

droettger · June 22, 2021, 9:10am

Our quality assurance is not able to reproduce the corrupted optixHello image output on an A100 using the 450.119.04 drivers or some newer internal branch with either pre-compiled or gcc-8 and CUDA 10.1 built versions of the executable.

Could you add any additional information about the exact system setup and operating system configuration to see if there is any mismatch?
Otherwise there is nothing to be done about your issue.
This is the only report about such a behavior and nothing would work if this was a systematic error.

In any case, I would recommend porting to an OptiX 7 version, since that is the future proof way and will result in a more flexible and faster implementation. It’s quite some undertaking though since the host code basically needs a rewrite, but I think it’s worth it.
I have demonstrated the necessary host and device code changes by porting some of my own OptiX 5.1 based introduction examples to OptiX 7 versions here: https://github.com/NVIDIA/OptiX_Apps
Links to the old examples in the README.md.

PPavel · August 12, 2021, 4:06pm

Hi @droettger ,

I’m sorry for the late reply. The problem persists in the pre-compiled samples, as well in the other sample (see for example the output of optixSphere: optix_sphere.ppm (2.3 MB)

I use Ubuntu 20.04.2 LTS. Here is what uname -a outputs:

Linux astra-tesla-10 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

nvidia-smi outputs the following:

Output of nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

I build the samples as follows:

cd NVIDIA-OptiX-SDK-6.5.0-linux64/SDK
mkdir build_cuda10_cmake
cd build_cuda10_cmake
cmake -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8 ../
make

Here is the output of ldd ./bin/optixHello:

linux-vdso.so.1 (0x00007ffc4b3e2000)
libsutil_sdk.so => /home/pavel/NVIDIA-OptiX-SDK-6.5.0-linux64/SDK/build_cuda10_cmake/lib/libsutil_sdk.so (0x00007f13aacba000)|
liboptix.so.6.5.0 => /home/pavel/NVIDIA-OptiX-SDK-6.5.0-linux64/lib64/liboptix.so.6.5.0 (0x00007f13aa9c9000)|
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f13aa7d0000)|
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f13aa7b5000)|
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f13aa5c3000)|
libglut.so.3 => /usr/lib/x86_64-linux-gnu/libglut.so.3 (0x00007f13aa378000)|
libOpenGL.so.0 => /usr/lib/x86_64-linux-gnu/libOpenGL.so.0 (0x00007f13aa34c000)|
libGLX.so.0 => /usr/lib/x86_64-linux-gnu/libGLX.so.0 (0x00007f13aa318000)|
libnvrtc.so.10.1 => /home/pavel/cuda/lib64/libnvrtc.so.10.1 (0x00007f13a8ba8000)|
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f13a8a59000)|
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f13a8a51000)|
/lib64/ld-linux-x86-64.so.2 (0x00007f13aadba000)|
libGL.so.1 => /usr/lib/x86_64-linux-gnu/libGL.so.1 (0x00007f13a89c9000)|
libX11.so.6 => /usr/lib/x86_64-linux-gnu/libX11.so.6 (0x00007f13a888c000)|
libXi.so.6 => /usr/lib/x86_64-linux-gnu/libXi.so.6 (0x00007f13a887a000)|
libXxf86vm.so.1 => /usr/lib/x86_64-linux-gnu/libXxf86vm.so.1 (0x00007f13a8873000)|
libGLdispatch.so.0 => /usr/lib/x86_64-linux-gnu/libGLdispatch.so.0 (0x00007f13a87bb000)|
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f13a8796000)|
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f13a878b000)|
libxcb.so.1 => /usr/lib/x86_64-linux-gnu/libxcb.so.1 (0x00007f13a8761000)|
libXext.so.6 => /usr/lib/x86_64-linux-gnu/libXext.so.6 (0x00007f13a874c000)|
libXau.so.6 => /usr/lib/x86_64-linux-gnu/libXau.so.6 (0x00007f13a8746000)|
libXdmcp.so.6 => /usr/lib/x86_64-linux-gnu/libXdmcp.so.6 (0x00007f13a873c000)|
libbsd.so.0 => /usr/lib/x86_64-linux-gnu/libbsd.so.0 (0x00007f13a8722000)|

I start optixHello as follows:

./bin/optixHello --file ./optix_hello_output.ppm

I hope this information will help you to replicate the problem. One more note: I connect to the server via ssh, and openGL rendering doesn’t work for me. That’s why I use direct output to PPM.
Please write me if you need any other details.

Thank you for the help on porting to OptiX 7! Since it works perfectly well on A100 GPU, we decided to go for it (in fact, that’s why it took me quite some time to reply to you).

Kind redards,
Pavel

droettger · August 12, 2021, 4:29pm

Ok, you never mentioned that this was a multi-GPU related issue. That’s why system configuration information are crucial for all reported issues.
There are 64x8 size blocks rendered and 64x56 size blocks missing in your optix_sphere.ppm, which looks like 7 of the 8 GPUs are not correctly read back.
I’ll add that information to the bug report.