Riva_server fails if not all Triton models are loaded

Hardware - GPU: A5000
Hardware - CPU: AMD
Operating System: Debian (riva-api docker image)
Riva Version: 2.7.0

I have many model configurations that I want to test and load/unload as needed and which won’t fit on a single GPU.
If my Triton server with multiple models available on disk are not all loaded, riva_server fails with the following error:

E1222 18:45:51.512120   182 model_registry.cc:102] error: cannot get model config Request for unknown model: 'mymodelname-streaming' is not found

Where mymodelname refers to the model name of course.

If I try to load then unload all models so that their state is at least set properly, I get

error: cannot get model config Request for unknown model: 'mymodelname-streaming' version 1 is not at ready state

Is it not possible to use riva without loading all the models?

Hi @pineapple9011

Apologies for the delay,

Can you please share your

  1. config.sh used
  2. docker logs riva-speech complete output

Will check further with the team on this request

Thanks

I am not using the Riva quickstart setup (i.e. config.sh). I have my own setup which launches Riva and Triton from the docker container since we build our models ourselves.
There is nothing special about the logs, no warnings or errors, just the typical Triton startup logs followed by Riva failing with the above error.

Ex. startup logs:

...
I0110 12:51:30.278211 111 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +327, now: CPU 2, GPU 2212 (MiB)
W0110 12:51:30.278233 111 logging.cc:46] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
I0110 12:51:30.722238 111 tensorrt.cc:1547] Created instance mymodelname-streaming-am-streaming_0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0110 12:51:30.722291 111 endpointing_library.cc:22] TRITONBACKEND_ModelInstanceInitialize: mymodelname-streaming-endpointing-streaming_0 (device 0)
I0110 12:51:30.759757 111 model_lifecycle.cc:693] successfully loaded 'mymodelname-streaming-am-streaming' version 1
I0110 12:51:30.777475 111 ctc-decoder-library.cc:23] TRITONBACKEND_ModelInstanceInitialize: mymodelname-streaming-ctc-decoder-cpu-streaming_0 (device 0)
I0110 12:51:30.777575   129 ctc-decoder.cc:174] Beam Decoder initialized successfully!
I0110 12:51:30.777602 111 feature-extractor.cc:402] TRITONBACKEND_ModelInstanceInitialize: mymodelname-streaming-feature-extractor-streaming_0 (device 0)
I0110 12:51:30.777780 111 model_lifecycle.cc:693] successfully loaded 'mymodelname-streaming-endpointing-streaming' version 1
I0110 12:51:30.778117 111 model_lifecycle.cc:693] successfully loaded 'mymodelname-streaming-ctc-decoder-cpu-streaming' version 1
I0110 12:51:30.782237 111 model_lifecycle.cc:693] successfully loaded 'mymodelname-streaming-feature-extractor-streaming' version 1
I0110 12:51:30.782509 111 model_lifecycle.cc:459] loading: mymodelname-streaming:1
I0110 12:51:30.782726 111 model_lifecycle.cc:693] successfully loaded 'mymodelname-streaming' version 1
...

Hi @pineapple9011

Can you please share the details steps on how you launch Riva and Triton from the docker container

Please do share the complete docker logs riva-speech

First I start Triton:

tritonserver --disable-auto-complete-config \
  --strict-model-config=true \
  --model-repository /data/models \
  --model-control-mode=explicit \
  --cuda-memory-pool-byte-size=0:1000000000 &
triton_pid=$!
block_until_server_alive $triton_pid

I then run a script (prepare_triton.py - Pastebin.com) to load the models I want into Triton.

python3 prepare_triton.py --model_names "$MODEL_NAMES"

Where MODEL_NAMES is an environment variables that’s a comma delimited string of model names to load.

Then I run Riva

riva_server --triton_uri=localhost:8001 "$@" &

This all shouldn’t matter though. I just need a yes/no answer to my original question: Is it possible to use Riva without loading all the models?

Thanks @pineapple9011

Apologies for the delay, will surely check with the team on this and get back

Thanks

HI @pineapple9011

I have inputs from the internal team

  1. Running triton directly is not supported currently.
  2. Riva server doesn’t support --model-control-mode=explicit which user is using. In riva, --model-control-mode is NONE.
  3. riva_server will load everything in data/models.
  4. User should use setting up different docker volumes for their configs when wanting to load particular models.
    In config.sh,

Models ($riva_model_loc/models)

During the riva_init process, the RMIR files in $riva_model_loc/rmir

are inspected and optimized for deployment. The optimized versions are

stored in $riva_model_loc/models. The riva server exclusively uses these

optimized versions.

riva_model_loc=“riva-model-repo”

riva_model_loc controls the name of the volume or if preceded by a / it uses a hostpath.

docker volumes will also help in tackling memory related challenge.
with docker volumes, you can divide your models however you prefer.

  1. Best advice is to follow the guidelines in quickstart guide and use riva scripts provided.