Has anyone run Riva Speech Skills in Ubuntu on WSL2

Hardware - GPU (RTX5000)
Hardware - CPU (Xeon W-10855M)
Operating System (Windows 11 - Version 10.0.22449 Build 22449)
Riva Version (v1.4.0-beta)
TLT Version (n/a)
How to reproduce the issue:
Follow the CUDA on WSL instructions and the quick start instructions provided in NGC for launching Riva in Ubuntu. The riva_init.sh script completes, but the riva_start.sh script fails immediately with error:

/opt/riva/bin/start-riva: line 4: 43 Segmentation fault

Note - I am able to run the sample CUDA container in the same Ubuntu distribution:

docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark

Where would I begin to debug the riva_start.sh script to allow running in a WSL2 Ubuntu distribution as I have attempted above?

The main point:
I - and many others? - have Windows enterprise machines that do not have Ubuntu installed in dual-boot situation. Without linux, it is very difficult to test the Riva framework, so I’m trying to build a bridge with WSL2.

Thanks!

1 Like

Hi @cameron.willis ,
We are currently checking on this. Please allow us some time.
Thanks!

Hi,
I’m getting the same error. I’m having a look if I can find out more.
These are the logs I got: segfault.txt (28.2 KB)

First I was looking into the nvidia container toolkit log: nct_dbg.txt (12.7 KB)
Then I decided to go into the container and have a look at the triton server. Running it directly through a bash shell with its necessary arguments gave me the segfault. Under gdb I got this trace: trace.txt (6.2 KB)
It seems to crash at some strlen-avx2 inlined into Metrics::UUIDForCudaDevice. UUIDForCudaDevice seemed ok to my eyes as long as everything returned by the used functions is valid. The first obvious usage of a string is in the error output for the case when the device is not valid. If the char* returned by the dcgm errorString function is invalid then piping it into the output stream would try to get the length of the c string and segfault.
I didn’t go over each individual error code, but what made me curious is that there are 50 case statements in the errorString function but dcgmReturn_enum has 52 elements (-52 to 0, -2 missing). So there are two error codes that return a null pointer instead of a valid error string. I’m not sure if this is it, since i don’t have the pdbs and don’t get line numbers in the trace.
@AakankshaS could you check this?
The other question then would be why the device information is not available…

1 Like

ok just checked the codes. indeed the culprit seem to be those two:
DCGM_ST_INSTANCE_NOT_FOUND = -45, //!< The specified GPU instance does not exist
DCGM_ST_COMPUTE_INSTANCE_NOT_FOUND = -46, //!< The specified GPU compute instance does not exist
So now the question is, why is the gpu not found?
I’m running windows 11 and installed this driver. I used the default Ubuntu 20.04 LTS distribution that comes with the first install of WSL2 and installed docker and the nvidia-container-toolkit inside Ubuntu running on WSL2. I copied the files for the quick start tutorial to my home directory and ended up with the same issue as @cameron.willis