optixDeviceContextCreate() Segmentation Fault on Tesla T4 in AWS

This is a EC2 instance set up as g4dn.xlarge.

This is what nvidia-smi returns when ran from that machine:

$ nvidia-smi
Fri Feb 24 18:37:56 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 38C P8 15W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

deviceQuery seems to run just fine, as well.

In this particular machine, an app I created that uses OptiX crashes as reported above. In another machine, which is setup as g4dn.4xlarge, the error does not show up.

Since these are instances which only can be used through command line interface, I wasn’t able to compile or run the OptiX SDK examples.

I’m at a loss on why this would happen. Do you have any pointers on what to check to make this machine run not just my app, but any app with OptiX?

Hi @heriberto.delgado,

I’m not sure what might cause this, especially if you can get it working on another instance of the same type. It sounds like there probably is a difference in either the hardware or the installed driver or the OS. Is it easy to flash/reinstall the OS & driver? You can look for the OptiX components in the installed driver to make sure they are there, as well as use strace to verify it’s picking them up during initialization.

Did you initialize your CUDA context successfully before initializing OptiX? You might double-check, by synchronizing after CUDA initialization and/or using the CUDA context for something. Do CUDA SDK samples run on this machine? Your report from nvidia-smi shows the GPU having all available memory, is this true at OptiX init time too? Are you able to alloc & free memory before trying to init OptiX?

One minor note on OptiX SDK samples: at least a few of our samples will run from the command line without a display. First check out optixConsole, which is command-line only, and then take a look at samples that save the image to a file using the -f flag, like optixHello; these should also be runnable via the command line.


David.

Thanks for your response!

The app itself actually runs a CUDA kernel to gather some data required for the OptiX portion of it, so that’s how I know CUDA was initialized (and works) properly - in fact, the data gathering occurs immediately before the call to OptiX stuff.

I wasn’t aware that something like optixConsole existed in the SDK (I had the Windows version of the SDK at the time, didn’t see it until I downloaded the Linux version). Very interesting. I’ll take a look to see what can I get from it.

As for the driver, the guys at Amazon provided instructions to install new drivers in the instance, which (I believe) I followed correctly. Which are the names of the OptiX components in the driver that I should be looking for? This could be actually the answer for my issue.

The OptiX driver library is named libnvoptix, so you can grep your strace output to see where that’s being read from. Chances are you’ll see a list of locations it looks, and hopefully the last one will be where it’s found.

Another remote possibility is that the OptiX cache fails to init, which could be due to filesystem permissions problems. Check your strace output for mention of optix7cache.db, and make sure OptiX was able to open it properly.


David.