Optix 7.5 memory access problem

OptiX 7 applications compiled for release mode targets should never need to set the OPTIX_FORCE_DEPRECATED_LAUNCHER environment variables.

As an additional data point. I am also seeing illegal memory access errors from CUDA if I don’t set the OPTIX_FORCE_DEPRECATED_LAUNCHER variable. The odd thing is I see it in both our debug and release builds. In the quoted statement, what would be your definition of a release build? e.g. the optimization level, or the debug info settings, or a preprocessor define, etc. I’m trying to narrow down why we are seeing these crashes in release also. They only started happening when we moved from 525 drivers to 535.

My setup is:
OS - RHEL 7.9
CUDA 11.4 (possibly too old?)
Optix 7.7
drivers 535.86.05
GPU - RTX A6000

Thanks!
Mark

I am also seeing illegal memory access errors from CUDA if I don’t set the OPTIX_FORCE_DEPRECATED_LAUNCHER variable.

That should not happen and we would require a minimal and complete reproducer to be able to investigate what happens in your case.

Do all OptiX SDK 7.7.0 examples built as release and debug targets run on your configuration (without setting the OPTIX_FORCE_DEPRECATED_LAUNCHER environment variable)?

It shouldn’t matter too much with which CUDA Toolkit version you generated your OptiX device program input as long as the resulting PTX code is targeting SM 5.0 (Maxwell) to handle all supported GPU generations with one input source.
Though that version doesn’t support the new OptiX IR input format. That requires 11.7 or higher.
I used CUDA 11.8 for quite some time with no issues and currently use CUDA 12.1 also with no problems so far.

what would be your definition of a release build?

I’m building my fixed OptiX device program code with these NVCC command line options for release and debug targets.
(Note that these do not contain the --device-debug (-G) option.)
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/MDL_renderer/CMakeLists.txt#L214
and I use these OptixModuleCompileOptions
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/MDL_renderer/src/Device.cpp#L347
and these OptixPipelineCompileOptions:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/MDL_renderer/src/Device.cpp#L347
That is my definition of a release build.

I only enable OptiX exceptions and some more checks when setting the compile time define USE_DEBUG_EXCEPTIONS. That is usually not necessary unless something is really going wrong. (I actually just found an issue with the MDL thin_film modifier with that, generating negative values.)

That way the renderer remains at full speed when debugging the host code.

I just tested enabling --device-code and the USE_DEBUG_EXCEPTIONS define (otherwise OptiX 7.7.0 complains that the module compile options aren’t matching full debug settings) with that MDL_renderer example under Windows 10 running 535.98 drivers on an RTX 6000 Ada and had no issues with any configuration.
This was all without the OPTIX_FORCE_DEPRECATED_LAUNCHER environment variable. The only time I set that is when using Nsight Compute on currently released R535 drivers.

I cannot rule out issues in your Linux display drivers or your application esp if you’re not reproducing this with the OptiX SDK examples, so we’d need a reproducer.

Thanks for the detailed response.

I verified we’re building with equivalent options to what you suggested.

Simple geometry only code paths work fine in release but when we turn on functionality like lighting and texturing (without setting OPTIX_FORCE_DEPRECATED_LAUNCHER) we crash immediately. This makes it not too trivial to create a compact reproducer. I’m not ruling out it being a latent bug in our code which only became apparent when switching to the 535 drivers. It’s hard to determine what’s happen since I’m not getting any printf or assert output from optix the crash occurs.

I don’t have quick access to the SDK samples on my machine but will look into testing with them.

Ultimately it’s not too pressing an issue for us right now since we can set the OPTIX_FORCE_DEPRECATED_LAUNCHER variable and everything works as expected.

Yeah, CUDA illegal address errors can be tricky. Unfortunately capturing these inside the Nsight VSE or cuda-gdb debuggers inside OptiX device code requires the OPTIX_FORCE_DEPRECATED_LAUNCHER and OptiX IR input at this time.

Ultimately it’s not too pressing an issue for us right now since we can set the OPTIX_FORCE_DEPRECATED_LAUNCHER variable and everything works as expected.

This will potentially not fix itself if we’re not able to reproduce your specific case.

Simple geometry only code paths work fine in release but when we turn on functionality like lighting and texturing (without setting OPTIX_FORCE_DEPRECATED_LAUNCHER) we crash immediately. This makes it not too trivial to create a compact reproducer.

We can also work with full application reproducers, if that is an option. We would just need an installer and minimal reproducer scene along with detailed step-by-step repro description assuming a clean OS system setup and zero experience with your application.

I’m looking into the possibility of sending a reproducer at the moment. There are some clearances I need to get on my end… stay tuned.