Jarvis_start.sh times out

Hi,

Jarvis looks like a very interesting and useful project. However, I seem to have issues running it. I have RTX 3090 and when I run jarvis_start.sh it times out waiting for triton to start.

$ ./jarvis_start.sh
...
+ '[' 2 -ne 0 ']'
+ echo 'Waiting for Jarvis server to load all models...retrying in 10 seconds'
Waiting for Jarvis server to load all models...retrying in 10 seconds
+ sleep 10
+ echo 'Health ready check failed.'
Health ready check failed.
+ echo 'Check Jarvis logs with: docker logs jarvis-speech'
Check Jarvis logs with: docker logs jarvis-speech
+ exit 1

From the docker logs it looks like triton all of a sudden unloads all the models. The only thing off I could see is that tacotron2_ensemble is not found for some reason. And also that there were some problems with tacotron2_decoder_postnet header.

I manually deleted tacotron_* from /data/models and re-run deploy_all_models /data/jmir /data/models, but it didnā€™t help. I also tired running tritonserver with --strict-model-config=false, again didnā€™t hlep.
(I didnā€™t touch config.sh, so everything is default)

E0418 17:27:42.601750 415 logging.cc:43] coreReadArchive.cpp (32) - Serialization Error in verifyHeader: 0 (Magic tag does not match)
E0418 17:27:42.601784 415 logging.cc:43] INVALID_STATE: std::exception
E0418 17:27:42.601790 415 logging.cc:43] INVALID_CONFIG: Deserialize the cuda engine failed.
W0418 17:27:42.623235 415 autofill.cc:225] Autofiller failed to detect the platform for jarvis-trt-waveglow (verify contents of model directory or use --log-verbose=1 for more details)
W0418 17:27:42.623242 415 autofill.cc:248] Proceeding with simple config for now
E0418 17:27:42.623247 415 model_repository_manager.cc:1682] unexpected platform type  for jarvis-trt-waveglow
E0418 17:27:42.757081 415 logging.cc:43] coreReadArchive.cpp (32) - Serialization Error in verifyHeader: 0 (Magic tag does not match)
E0418 17:27:42.757116 415 logging.cc:43] INVALID_STATE: std::exception
E0418 17:27:42.757122 415 logging.cc:43] INVALID_CONFIG: Deserialize the cuda engine failed.
E0418 17:27:42.872995 415 logging.cc:43] coreReadArchive.cpp (32) - Serialization Error in verifyHeader: 0 (Magic tag does not match)
E0418 17:27:42.873026 415 logging.cc:43] INVALID_STATE: std::exception
E0418 17:27:42.873032 415 logging.cc:43] INVALID_CONFIG: Deserialize the cuda engine failed.
W0418 17:27:42.880388 415 autofill.cc:225] Autofiller failed to detect the platform for tacotron2_decoder_postnet (verify contents of model directory or use --log-verbose=1 for more details)
 ...
 ...
| Model                         | Version | Status                                 |
+-------------------------------+---------+----------------------------------------+
...
| tacotron2_decoder_postnet     | 1       | READY                                  |
| tacotron2_ensemble            | -       | Not loaded: No model version was found |
| tts_preprocessor              | 1       | READY                                  |
...
...

I0418 17:30:15.524788 56 server.cc:235] Timeout 30: Found 17 live models and 0 in-flight non-inference requests
   > Jarvis waiting for Triton server to load all models...retrying in 1 second
I0418 17:30:16.525058 56 server.cc:235] Timeout 29: Found 1 live models and 0 in-flight non-inference requests

Any help or tips would be appreciated!

2 Likes

Hi @artem5,
Could you please share the complete error log (console output) and system details (GPU Type, Windows/Linux Version, docker version etc.) so we can help better?

Thanks

Iā€™ve been receiving some similar output. Here is the output of docker logs jarvis-speech:

Jarvis waiting for Triton server to load all modelsā€¦retrying in 1 second
I0421 16:34:06.496129 51 onnxruntime.cc:1728] TRITONBACKEND_Initialize: onnxruntime
I0421 16:34:06.496415 51 onnxruntime.cc:1738] Triton TRITONBACKEND API version: 1.0
I0421 16:34:06.496424 51 onnxruntime.cc:1744] ā€˜onnxruntimeā€™ TRITONBACKEND API version: 1.0
Jarvis waiting for Triton server to load all modelsā€¦retrying in 1 second
I0421 16:34:07.851614 51 pinned_memory_manager.cc:205] Pinned memory pool is created at ā€˜0x2033f0000ā€™ with size 268435456
I0421 16:34:07.852185 51 cuda_memory_manager.cc:103] CUDA memory pool is created on device 0 with size 1000000000
E0421 16:34:07.869046 51 model_repository_manager.cc:1682] failed to open text file for read /data/models/jarvis-trt-jarvis_intent_weather-nn-bert-base-uncased/config.pbtxt: No such file or directory
E0421 16:34:07.872539 51 model_repository_manager.cc:1682] failed to open text file for read /data/models/jarvis_label_tokens_weather/config.pbtxt: No such file or directory
E0421 16:34:07.876931 51 model_repository_manager.cc:1160] Invalid argument: ensemble jarvis_intent_weather contains models that are not available: jarvis-trt-jarvis_intent_weather-nn-bert-base-uncased, jarvis_label_tokens_weather
I0421 16:34:07.877045 51 model_repository_manager.cc:787] loading: jarvis-trt-jarvis_ner-nn-bert-base-uncased:1
I0421 16:34:07.977731 51 model_repository_manager.cc:787] loading: jarvis-trt-jarvis_punctuation-nn-bert-base-uncased:1
I0421 16:34:08.078528 51 model_repository_manager.cc:787] loading: jarvis-trt-jarvis_qa-nn-bert-base-uncased:1
I0421 16:34:08.179930 51 model_repository_manager.cc:787] loading: jarvis-trt-jarvis_text_classification_domain-nn-bert-base-uncased:1
I0421 16:34:08.281365 51 model_repository_manager.cc:787] loading: jarvis-trt-jasper:1
I0421 16:34:08.382578 51 model_repository_manager.cc:787] loading: jarvis-trt-tacotron2_encoder:1
Jarvis waiting for Triton server to load all modelsā€¦retrying in 1 second
I0421 16:34:08.484958 51 model_repository_manager.cc:787] loading: jarvis-trt-waveglow:1
I0421 16:34:08.587122 51 model_repository_manager.cc:787] loading: jarvis_detokenize:1
I0421 16:34:08.689370 51 model_repository_manager.cc:787] loading: jarvis_ner_label_tokens:1
I0421 16:34:08.693862 51 custom_backend.cc:198] Creating instance jarvis_detokenize_0_0_cpu on CPU using libtriton_jarvis_nlp_detokenizer.so
I0421 16:34:08.714426 51 model_repository_manager.cc:960] successfully loaded ā€˜jarvis_detokenizeā€™ version 1
I0421 16:34:08.790664 51 model_repository_manager.cc:787] loading: jarvis_punctuation_gen_output:1
I0421 16:34:08.790749 51 custom_backend.cc:198] Creating instance jarvis_ner_label_tokens_0_0_cpu on CPU using libtriton_jarvis_nlp_seqlabel.so
I0421 16:34:08.800779 51 model_repository_manager.cc:960] successfully loaded ā€˜jarvis_ner_label_tokensā€™ version 1
I0421 16:34:08.891157 51 model_repository_manager.cc:787] loading: jarvis_punctuation_label_tokens_cap:1
I0421 16:34:08.891214 51 custom_backend.cc:198] Creating instance jarvis_punctuation_gen_output_0_0_cpu on CPU using libtriton_jarvis_nlp_punctuation.so
I0421 16:34:08.898853 51 model_repository_manager.cc:960] successfully loaded ā€˜jarvis_punctuation_gen_outputā€™ version 1
I0421 16:34:08.991919 51 model_repository_manager.cc:787] loading: jarvis_punctuation_label_tokens_punct:1
I0421 16:34:08.991972 51 custom_backend.cc:198] Creating instance jarvis_punctuation_label_tokens_cap_0_0_cpu on CPU using libtriton_jarvis_nlp_seqlabel.so
I0421 16:34:08.992487 51 model_repository_manager.cc:960] successfully loaded ā€˜jarvis_punctuation_label_tokens_capā€™ version 1
I0421 16:34:09.093418 51 model_repository_manager.cc:787] loading: jarvis_punctuation_merge_labels:1
I0421 16:34:09.093488 51 custom_backend.cc:198] Creating instance jarvis_punctuation_label_tokens_punct_0_0_cpu on CPU using libtriton_jarvis_nlp_seqlabel.so
I0421 16:34:09.094028 51 model_repository_manager.cc:960] successfully loaded ā€˜jarvis_punctuation_label_tokens_punctā€™ version 1
I0421 16:34:09.193947 51 model_repository_manager.cc:787] loading: jarvis_qa_postprocessor:1
I0421 16:34:09.194125 51 custom_backend.cc:198] Creating instance jarvis_punctuation_merge_labels_0_0_cpu on CPU using libtriton_jarvis_nlp_labels.so
I0421 16:34:09.204057 51 model_repository_manager.cc:960] successfully loaded ā€˜jarvis_punctuation_merge_labelsā€™ version 1
I0421 16:34:09.294694 51 model_repository_manager.cc:787] loading: jarvis_qa_preprocessor:1
I0421 16:34:09.294842 51 custom_backend.cc:198] Creating instance jarvis_qa_postprocessor_0_0_cpu on CPU using libtriton_jarvis_nlp_qa.so
I0421 16:34:09.333444 51 model_repository_manager.cc:960] successfully loaded ā€˜jarvis_qa_postprocessorā€™ version 1
I0421 16:34:09.395174 51 model_repository_manager.cc:787] loading: jarvis_tokenizer:1
I0421 16:34:09.395389 51 custom_backend.cc:198] Creating instance jarvis_qa_preprocessor_0_0_cpu on CPU using libtriton_jarvis_nlp_tokenizer.so
Jarvis waiting for Triton server to load all modelsā€¦retrying in 1 second
I0421 16:34:09.462604 51 model_repository_manager.cc:960] successfully loaded ā€˜jarvis_qa_preprocessorā€™ version 1
I0421 16:34:09.495644 51 model_repository_manager.cc:787] loading: jasper-asr-trt-ensemble-vad-streaming-ctc-decoder-cpu-streaming:1
I0421 16:34:09.495692 51 custom_backend.cc:198] Creating instance jarvis_tokenizer_0_0_cpu on CPU using libtriton_jarvis_nlp_tokenizer.so
I0421 16:34:09.518993 51 model_repository_manager.cc:960] successfully loaded ā€˜jarvis_tokenizerā€™ version 1
I0421 16:34:09.597203 51 model_repository_manager.cc:787] loading: jasper-asr-trt-ensemble-vad-streaming-feature-extractor-streaming:1
I0421 16:34:09.597267 51 custom_backend.cc:198] Creating instance jasper-asr-trt-ensemble-vad-streaming-ctc-decoder-cpu-streaming_0_0_cpu on CPU using libtriton_jarvis_asr_decoder_cpu.so
I0421 16:34:09.698796 51 model_repository_manager.cc:787] loading: jasper-asr-trt-ensemble-vad-streaming-offline-ctc-decoder-cpu-streaming-offline:1
I0421 16:34:09.698820 51 custom_backend.cc:201] Creating instance jasper-asr-trt-ensemble-vad-streaming-feature-extractor-streaming_0_0_gpu0 on GPU 0 (6.1) using libtriton_jarvis_asr_features.so
/opt/jarvis/bin/start-jarvis: line 4: 51 Segmentation fault tritonserver --log-verbose=0 --strict-model-config=true $model_repos --cuda-memory-pool-byte-size=0:1000000000
Triton server died before reaching ready state. Terminating Jarvis startup.
Check Triton logs with: docker logs
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ā€¦ or kill -l [sigspec]`

Hi @amin3,

We are looking into it, will let you know in case if any updates.

Thanks

This error
/data/models/jarvis-trt-jarvis_intent_weather-nn-bert-base-uncased/config.pbtxt: No such file or directory

I was getting that same error. It was because when running jarvis_init it failed on some models. So to fix that I ran this command
docker run --init -it --rm --gpus "device=0" -v jarvis-model-repo:/data --name jarvis-speech-maker nvcr.io/nvidia/jarvis/jarvis-speech:1.0.0-b.3-server /bin/bash
With that I got a shell into the container and went to the /data/models and then removed all the models that gave the config.pbtxt error.
(this is probably because it failed to generate the model when running the jarvis_init)
Then I ran the jarvis_init again.
I kept doing this till most the models gave no error.
But I kept on getting an error with jarvis-trt-waveglow where it said
Exception: build_waveglow failed to generate waveglow.eng.
Otherwise all the others didnā€™t give the config.pbtxt error.
I also do get get a cuda error in copytodevice: 2 (out of memory) error when trying to start the jarvis server probably because I donā€™t have enough gpu memory.

Hi @atomynosatom

Could you please try commenting out all the NLP models except 1 and see if that deploys successfully on your setup.

Thanks

I have done that but it fails to deploy because of waveglow and I still get the out of memory error. I am pretty sure that it is because of the gpu having only 8GB vram while the minimum is 16

1 Like