Riva Speech Server Fails to Start Due to Model Loading Errors

I encountered several issues while running the assessment.ipynb file in the NVIDIA DLI course “Building Conversational AI Applications”, specifically during the setup and execution of Riva Speech Skills. Below are the details of the problems:

  1. Model Loading Failure:
  • The Triton Inference Server fails to load models and terminates unexpectedly during the initialization phase.
  • Error messages from docker logs include:
error: creating server: Internal - failed to load all models
> Triton server died before reaching ready state. Terminating Riva startup.
  • Additional errors indicate missing configuration files (config.pbtxt) for certain models:
Poll failed for model directory '1': failed to open text file for read /data/models/1/config.pbtxt: No such file or directory
  1. Excessive Number of Models:
  • The /data/models directory contains an excessive number of models, including ASR, TTS, and NLP-related models.
  • This appears to prolong the model loading process, leading to timeout issues.
  1. Timeout Issues:
  • Riva Speech Skills waits for the models to load but fails due to a timeout:
Timeout 29: Found 4 live models and 0 in-flight non-inference requests
  • Despite increasing the timeout value, the server still fails to initialize all models successfully.
  1. Docker Environment Configuration:
  • Potential misconfigurations in Docker container resource allocation (e.g., memory, GPU usage) could also be contributing to the problem.

Steps Taken:

  • Verified the contents of the /data/models directory using docker exec and confirmed that some models are missing critical files like config.pbtxt.
  • Attempted to reduce the number of models by only keeping those relevant to ASR, but the server still fails to start.
  • Edited the riva_start.sh script to extend the timeout period but encountered the same issue.

Request for Assistance:

  1. What is the recommended way to handle the excessive number of models? Is there a list of essential models required for basic ASR functionality?
  2. How can I ensure all necessary model files (e.g., config.pbtxt) are present and properly configured?
  3. Are there additional changes needed in the riva_start.sh script or Docker configuration to resolve this issue?
  4. Could there be compatibility issues between the Triton Inference Server and Riva Speech Skills, given the current setup?

Any guidance or suggestions to resolve these issues would be greatly appreciated. Thank you!

Hi @jepetolee thanks for sharing the detailed issues here. I’m reaching out to the course owner to get back to you! Thanks for your patience.

Can you please run riva_clean.sh and try again?
For running the basic ASR functionality, NMT and TTS models are not needed, you can disable them from config.sh

@jepetolee I’m sorry you are having difficulties running the assessment. I suspect your issue is caused by some model loading mis-matches going on in background and unique to this course. Here are some tips to hopefully get you across the finish line!

  1. If you are spinning up the course and jumping to the assessment on a cold start, in addition to setting up your NGC key again in notebook 3, it is safest to wait for all the background data loads and Docker image loads to finish. Getting to this state takes 18-20 minutes, but jumping in early may have unpredictable results. You can check the status by looking in dli_workspace. When everything is loaded, there should be no .tar or .tgz files remaining. During the original in-person delivery of the course, this background data load was complete by the time it was needed due to lectures and so on, so you may not have been aware it was occurring.
  2. Correctly setting up config.sh in step 1 is critical. Pay attention to the hint: “# Check your work - are all three services enabled? Is the model location repo correct?”
  3. Proceed through the assessment steps, not skipping anything. Pay attention to the instructions, FIXME sections, and “Check your work” hints.

If you follow the tips above, you should not get the errors you reported. I just went through it myself and had no errors.