Something wrong with riva quickstart

Tried on two different systems and riva quickstart keeps failing to launch.

Hi @ryein

Thanks for your interest in Riva

Can you please share with us the

  1. whether is it riva or riva-embedded
  2. config.sh used
  3. config.sh used
  4. complete log output of bash riva_init.sh
  5. complete log output of bash riva_start.sh

Thanks

Not sure if it’s same issue, but I also had trouble with:

  • riva_quickstart_v2.8.1
  • riva - not embeded
  • config.sh - unmodified/default

riva_init.sh was failing with:

....
To install the open-source samples corresponding to this TensorRT release version
run /opt/tensorrt/install_opensource.sh.  To build the open source parsers,
plugins, and samples for current top-of-tree on master or a different branch,
run /opt/tensorrt/install_opensource.sh -b <branch>
See https://github.com/NVIDIA/TensorRT for more information.
ERROR: No supported GPU(s) detected to run this container

Failed to detect NVIDIA driver version.

/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
TensorRT is not available! Will use ONNX backend instead.
2023-01-02 21:17:04,748 [INFO] Writing Riva model repository to '/data/models'...
2023-01-02 21:17:04,748 [INFO] The riva model repo target directory is /data/models
2023-01-02 21:17:06,252 [INFO] Using onnx runtime
2023-01-02 21:17:06,253 [INFO] Extract_binaries for language_model -> /data/models/riva-onnx-riva_text_classification_domain-nn-bert-base-uncased/1
2023-01-02 21:17:06,253 [INFO] extracting {'ckpt': ('nemo.collections.nlp.models.text_classification.text_classification_model.TextClassificationModel', 'model_weights.ckpt'), 'bert_config_file': ('nemo.collections.nlp.models.text_classification.text_classification_model.TextClassificationModel', 'bert-base-uncased_encoder_config.json')} -> /data/models/riva-onnx-riva_text_classification_domain-nn-bert-base-uncased/1
2023-01-02 21:17:07,806 [INFO] Printing copied artifacts:
2023-01-02 21:17:07,806 [INFO] {'ckpt': '/data/models/riva-onnx-riva_text_classification_domain-nn-bert-base-uncased/1/model_weights.ckpt', 'bert_config_file': '/data/models/riva-onnx-riva_text_classification_domain-nn-bert-base-uncased/1/bert-base-uncased_encoder_config.json'}
2023-01-02 21:17:07,806 [ERROR] Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/servicemaker/cli/deploy.py", line 100, in deploy_from_rmir
    generator.serialize_to_disk(
  File "/usr/local/lib/python3.8/dist-packages/servicemaker/triton/triton.py", line 445, in serialize_to_disk
    module.serialize_to_disk(repo_dir, rmir, config_only, verbose, overwrite)
  File "/usr/local/lib/python3.8/dist-packages/servicemaker/triton/triton.py", line 311, in serialize_to_disk
    self.update_binary(version_dir, rmir, verbose)
  File "/usr/local/lib/python3.8/dist-packages/servicemaker/triton/triton.py", line 757, in update_binary
    self.update_binary_from_copied(version_dir, rmir, copied, verbose)
  File "/usr/local/lib/python3.8/dist-packages/servicemaker/triton/triton.py", line 734, in update_binary_from_copied
    raise Exception("Need TRT and bert_config_file for ckpt model")
Exception: Need TRT and bert_config_file for ckpt model

+ '[' 1 -ne 0 ']'
+ echo 'Error in deploying RMIR models.'
Error in deploying RMIR models.
+ exit 1

What I did to get it to proceed was to modify the docker run calls in riva_init.sh to use “–privileged” anywhere that call also used --gpus (there were 2 of such calls)

Context:

  • I’m running fresh install of PopOS 22.04/Ubuntu 22.04
  • Docker version 20.10.22, build 3a2c30b
  • nvidia docker v2.11.0
  • I’m using a non-root user to access my docker, but he is in the docker group

It now seems to be doing a whole lot of something… and I can see it’s using my GPU resources, so, that’s good. (not sure if I will run into issues with riva_start.sh yet, I have not gotten that far)

Hopefully that helps someone…

follow up on my previous post - I had to do the opposite on riva_start.sh - by adding “–gpus all” to the docker command which only had “–privileged” in it…

but then it did all seem to work, and I was able to run examples.

Note: I’m on a 4090, which doesn’t have enough ram to run all models at same time either, so I also disabled nlp and tts services at the top of config.sh and did a riva_clean.sh and then re-initialized.

Thanks for the info. I for sure need to play with it more. I’m sure it was user error.