Hardware - GPU (RTX5000)
Hardware - CPU (Xeon W-10855M)
Operating System (Windows 11 - Version 10.0.22449 Build 22449)
Riva Version (v1.4.0-beta)
TLT Version (n/a)
How to reproduce the issue:
Follow the CUDA on WSL instructions and the quick start instructions provided in NGC for launching Riva in Ubuntu. The riva_init.sh script completes, but the riva_start.sh script fails immediately with error:
/opt/riva/bin/start-riva: line 4: 43 Segmentation fault
Note - I am able to run the sample CUDA container in the same Ubuntu distribution:
docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Where would I begin to debug the riva_start.sh script to allow running in a WSL2 Ubuntu distribution as I have attempted above?
The main point:
I - and many others? - have Windows enterprise machines that do not have Ubuntu installed in dual-boot situation. Without linux, it is very difficult to test the Riva framework, so I’m trying to build a bridge with WSL2.
First I was looking into the nvidia container toolkit log: nct_dbg.txt (12.7 KB)
Then I decided to go into the container and have a look at the triton server. Running it directly through a bash shell with its necessary arguments gave me the segfault. Under gdb I got this trace: trace.txt (6.2 KB)
It seems to crash at some strlen-avx2 inlined into Metrics::UUIDForCudaDevice. UUIDForCudaDevice seemed ok to my eyes as long as everything returned by the used functions is valid. The first obvious usage of a string is in the error output for the case when the device is not valid. If the char* returned by the dcgm errorString function is invalid then piping it into the output stream would try to get the length of the c string and segfault.
I didn’t go over each individual error code, but what made me curious is that there are 50 case statements in the errorString function but dcgmReturn_enum has 52 elements (-52 to 0, -2 missing). So there are two error codes that return a null pointer instead of a valid error string. I’m not sure if this is it, since i don’t have the pdbs and don’t get line numbers in the trace. @AakankshaS could you check this?
The other question then would be why the device information is not available…
ok just checked the codes. indeed the culprit seem to be those two: DCGM_ST_INSTANCE_NOT_FOUND = -45, //!< The specified GPU instance does not exist DCGM_ST_COMPUTE_INSTANCE_NOT_FOUND = -46, //!< The specified GPU compute instance does not exist
So now the question is, why is the gpu not found?
I’m running windows 11 and installed this driver. I used the default Ubuntu 20.04 LTS distribution that comes with the first install of WSL2 and installed docker and the nvidia-container-toolkit inside Ubuntu running on WSL2. I copied the files for the quick start tutorial to my home directory and ended up with the same issue as @cameron.willis
This setting is only valid before you run riva_init.sh! If you cannot run successfuly, you should run ‘docker volume rm riva-model-repo’ and try again.
Now I can run ‘Riva Text to Speech Integration Example with Streaming Audio Player in Omniverse Audio2Face’.