Hardware - GPU A10
Question: How to run only one model using riva quickstart? I want to run the en-GB model without the en-US model.
Details:
I can successfully start riva ASR for en-US model. I have been trying for couple of days to start riva with the en-GB model and am not able to accomplish this.
I am using the riva quickstart repo 2.12.1 ( I have also tried with 2.11.0).
Steps I am taking:
- download the quickstart repo
- edit the
config.sh
to include:
service_enabled_asr=true
service_enabled_nlp=false
service_enabled_tts=false
service_enabled_nmt=false
language_code=("en-GB")
I do not edit anything else.
bash riva_init.sh
→ This completes successfullybash riva_start.sh
→ This fails.
The command runs for many many minutes, here is interesting parts from thedocker logs riva-speech
I0802 13:31:12.401523 102 model_lifecycle.cc:693] successfully loaded 'conformer-en-GB-asr-offline-ctc-decoder-cpu-streaming-offline' version 1
I0802 13:31:12.502004 102 model_lifecycle.cc:693] successfully loaded 'conformer-en-GB-asr-streaming-endpointing-streaming' version 1
I0802 13:31:13.412781 102 endpointing_library.cc:20] TRITONBACKEND_ModelInitialize: conformer-en-US-asr-offline-endpointing-streaming-offline (version 1)
I0802 13:31:13.516893 102 model_lifecycle.cc:693] successfully loaded 'conformer-en-GB-asr-offline-endpointing-streaming-offline' version 1
I0802 13:31:20.525963 102 pipeline_library.cc:28] TRITONBACKEND_ModelInstanceInitialize: riva-punctuation-en-GB_0 (device 0)
I0802 13:31:20.526448 102 model_lifecycle.cc:693] successfully loaded 'conformer-en-US-asr-streaming-ctc-decoder-cpu-streaming' version 1
I0802 13:31:20.540559 102 pipeline_library.cc:28] TRITONBACKEND_ModelInstanceInitialize: riva-punctuation-en-US_0 (device 0)
I0802 13:31:20.540669 102 model_lifecycle.cc:693] successfully loaded 'riva-punctuation-en-GB' version 1
I0802 13:31:20.554729 102 feature-extractor.cc:417] TRITONBACKEND_ModelInstanceInitialize: conformer-en-US-asr-streaming-feature-extractor-streaming_0 (device 0)
I0802 13:31:20.554864 102 model_lifecycle.cc:693] successfully loaded 'riva-punctuation-en-US' version 1
I0802 13:31:20.561134 102 feature-extractor.cc:417] TRITONBACKEND_ModelInstanceInitialize: conformer-en-GB-asr-offline-feature-extractor-streaming-offline_0 (device 0)
I0802 13:31:20.562079 102 model_lifecycle.cc:693] successfully loaded 'conformer-en-US-asr-streaming-feature-extractor-streaming' version 1
I0802 13:31:20.592794 102 feature-extractor.cc:417] TRITONBACKEND_ModelInstanceInitialize: conformer-en-GB-asr-streaming-feature-extractor-streaming_0 (device 0)
I0802 13:31:20.593456 102 model_lifecycle.cc:693] successfully loaded 'conformer-en-GB-asr-offline-feature-extractor-streaming-offline' version 1
I0802 13:31:20.599374 102 tensorrt.cc:5627] TRITONBACKEND_ModelInstanceInitialize: riva-trt-conformer-en-GB-asr-offline-am-streaming-offline_0 (GPU device 0)
I0802 13:31:20.599982 102 model_lifecycle.cc:693] successfully loaded 'conformer-en-GB-asr-streaming-feature-extractor-streaming' version 1
> Riva waiting for Triton server to load all models...retrying in 1 second
I0802 13:31:21.243123 102 logging.cc:49] Loaded engine size: 353 MiB
I0802 13:31:21.507173 102 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5000, GPU 18976 (MiB)
I0802 13:31:21.508489 102 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +331, now: CPU 0, GPU 331 (MiB)
I0802 13:31:21.541547 102 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4293, GPU 18976 (MiB)
> Riva waiting for Triton server to load all models...retrying in 1 second
I0802 13:31:22.909362 102 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +333, now: CPU 0, GPU 664 (MiB)
W0802 13:31:22.909394 102 logging.cc:46] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
> Riva waiting for Triton server to load all models...retrying in 1 second
I0802 13:31:23.103711 102 tensorrt.cc:1547] Created instance riva-trt-conformer-en-GB-asr-offline-am-streaming-offline_0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0802 13:31:25.986923 102 tensorrt.cc:1547] Created instance riva-trt-conformer-en-GB-asr-streaming-am-streaming_0 on GPU 0 with stream priority 0 and optimization profile default[0];
E0802 13:31:27.654518 102 logging.cc:43] 1: [graphContext.h::MyelinGraphContext::27] Error Code 1: Myelin (CUDA error 2 failed to create CUDA stream )
E0802 13:31:27.654751 102 model_lifecycle.cc:596] failed to load 'riva-trt-conformer-en-US-asr-offline-am-streaming-offline' version 1: Internal: unable to create TensorRT context
W0802 13:31:29.059793 102 logging.cc:46] Requested amount of GPU memory (707424256 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
Ah, now that I have properly read the log I can I am running out of memory.
I see it is loading the en-US model as well, how can I prevent that?
The en-US model by itself fits in my 24GB of memory. (I only have 16GB free, I need to run other processes in the GPU in parallel).