Hello, I have been running the non-NIM setup of RIVA v2.16 and 2.17 for a while since last Oct. The system runs and I can run all the sample scripts.
However, when I was repeat the same setup on a new H100 HGX machine, the riva_start.sh failed to load due to the missing model.plan file.
Note that I have copied the previous downloaded models and rmir files and updated config.sh to use those offline files to avoid repeated downloaded.
I never seen model.plan before and wondering when it has been added? I also tried to make a fresh download yet it still failed due to such model.plan file.
Below is the log of the failed container:
$ docker logs -f cdd82fc02d6f
==========================
=== Riva Speech Skills ===
NVIDIA Release 24.06 (build 99025715)
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
Riva waiting for Triton server to load all models…retrying in 1 second
I0121 21:26:57.786418 105 pinned_memory_manager.cc:241] Pinned memory pool is created at ‘0x74a84e000000’ with size 268435456
I0121 21:26:57.793758 105 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 1000000000
I0121 21:26:57.810813 105 model_lifecycle.cc:461] loading: conformer-en-US-asr-offline-asr-bls-ensemble:1
I0121 21:26:57.810908 105 model_lifecycle.cc:461] loading: conformer-en-US-asr-streaming-asr-bls-ensemble:1
I0121 21:26:57.810987 105 model_lifecycle.cc:461] loading: riva-onnx-fastpitch_encoder-English-US:1
I0121 21:26:57.811060 105 model_lifecycle.cc:461] loading: riva-punctuation-en-US:1
I0121 21:26:57.811130 105 model_lifecycle.cc:461] loading: riva-trt-conformer-en-US-asr-offline-am-streaming-offline:1
I0121 21:26:57.811194 105 model_lifecycle.cc:461] loading: riva-trt-conformer-en-US-asr-streaming-am-streaming:1
I0121 21:26:57.811253 105 model_lifecycle.cc:461] loading: riva-trt-hifigan-English-US:1
I0121 21:26:57.811310 105 model_lifecycle.cc:461] loading: riva-trt-riva-punctuation-en-US-nn-bert-base-uncased:1
I0121 21:26:57.811376 105 model_lifecycle.cc:461] loading: spectrogram_chunker-English-US:1
I0121 21:26:57.811457 105 model_lifecycle.cc:461] loading: tts_postprocessor-English-US:1
I0121 21:26:57.811529 105 model_lifecycle.cc:461] loading: tts_preprocessor-English-US:1
…
I0121 21:26:59.891518 105 pipeline_library.cc:28] TRITONBACKEND_ModelInstanceInitialize: riva-punctuation-en-US_0_0 (device 0)
I0121 21:26:59.914606 105 model_lifecycle.cc:818] successfully loaded ‘riva-punctuation-en-US’
I0121 21:26:59.918637 105 tensorrt.cc:65] TRITONBACKEND_Initialize: tensorrt
I0121 21:26:59.918670 105 tensorrt.cc:75] Triton TRITONBACKEND API version: 1.16
I0121 21:26:59.918675 105 tensorrt.cc:81] ‘tensorrt’ TRITONBACKEND API version: 1.15
I0121 21:26:59.918680 105 tensorrt.cc:105] backend configuration:
{“cmdline”:{“auto-complete-config”:“false”,“backend-directory”:“/opt/tritonserver/backends”,“min-compute-capability”:“6.000000”,“default-max-batch-size”:“4”}}
I0121 21:26:59.923059 105 tensorrt.cc:222] TRITONBACKEND_ModelInitialize: riva-trt-conformer-en-US-asr-offline-am-streaming-offline (version 1)
I0121 21:26:59.925645 105 tensorrt.cc:288] TRITONBACKEND_ModelInstanceInitialize: riva-trt-conformer-en-US-asr-offline-am-streaming-offline_0_0 (GPU device 0)
I0121 21:26:59.932020 105 tensorrt.cc:344] TRITONBACKEND_ModelInstanceFinalize: delete instance state
E0121 21:26:59.932037 105 backend_model.cc:635] ERROR: Failed to create instance: unable to find ‘/data/models/riva-trt-conformer-en-US-asr-offline-am-streaming-offline/1/model.plan’ for model instance ‘riva-trt-conformer-en-US-asr-offline-am-streaming-offline_0_0’
I0121 21:26:59.932060 105 tensorrt.cc:265] TRITONBACKEND_ModelFinalize: delete model state
E0121 21:26:59.932118 105 model_lifecycle.cc:621] failed to load ‘riva-trt-conformer-en-US-asr-offline-am-streaming-offline’ version 1: Unavailable: unable to find ‘/data/models/riva-trt-conformer-en-US-asr-offline-am-streaming-offline/1/model.plan’ for model instance ‘riva-trt-conformer-en-US-asr-offline-am-streaming-offline_0_0’
I0121 21:26:59.932133 105 model_lifecycle.cc:756] failed to load ‘riva-trt-conformer-en-US-asr-offline-am-streaming-offline’
I0121 21:26:59.934808 105 tensorrt.cc:222] TRITONBACKEND_ModelInitialize: riva-trt-conformer-en-US-asr-streaming-am-streaming (version 1)
I0121 21:26:59.935180 105 tensorrt.cc:288] TRITONBACKEND_ModelInstanceInitialize: riva-trt-conformer-en-US-asr-streaming-am-streaming_0_0 (GPU device 0)
Please provide the following information when requesting support.
Hardware - GPU - H100 HGX
Hardware - CPU - AMD EPYC 9754 Genoa
Operating System - Ubuntu 22.04 LTS
Riva Version - 2.16
TLT Version (if relevant)
How to reproduce the issue ? (This is for errors. Please share the command and the detailed log here)