Jarvis_start.sh times out

artem5 · April 18, 2021, 6:53pm

Hi,

Jarvis looks like a very interesting and useful project. However, I seem to have issues running it. I have RTX 3090 and when I run jarvis_start.sh it times out waiting for triton to start.

$ ./jarvis_start.sh
...
+ '[' 2 -ne 0 ']'
+ echo 'Waiting for Jarvis server to load all models...retrying in 10 seconds'
Waiting for Jarvis server to load all models...retrying in 10 seconds
+ sleep 10
+ echo 'Health ready check failed.'
Health ready check failed.
+ echo 'Check Jarvis logs with: docker logs jarvis-speech'
Check Jarvis logs with: docker logs jarvis-speech
+ exit 1

From the docker logs it looks like triton all of a sudden unloads all the models. The only thing off I could see is that tacotron2_ensemble is not found for some reason. And also that there were some problems with tacotron2_decoder_postnet header.

I manually deleted tacotron_* from /data/models and re-run deploy_all_models /data/jmir /data/models, but it didn’t help. I also tired running tritonserver with --strict-model-config=false, again didn’t hlep.
(I didn’t touch config.sh, so everything is default)

E0418 17:27:42.601750 415 logging.cc:43] coreReadArchive.cpp (32) - Serialization Error in verifyHeader: 0 (Magic tag does not match)
E0418 17:27:42.601784 415 logging.cc:43] INVALID_STATE: std::exception
E0418 17:27:42.601790 415 logging.cc:43] INVALID_CONFIG: Deserialize the cuda engine failed.
W0418 17:27:42.623235 415 autofill.cc:225] Autofiller failed to detect the platform for jarvis-trt-waveglow (verify contents of model directory or use --log-verbose=1 for more details)
W0418 17:27:42.623242 415 autofill.cc:248] Proceeding with simple config for now
E0418 17:27:42.623247 415 model_repository_manager.cc:1682] unexpected platform type  for jarvis-trt-waveglow
E0418 17:27:42.757081 415 logging.cc:43] coreReadArchive.cpp (32) - Serialization Error in verifyHeader: 0 (Magic tag does not match)
E0418 17:27:42.757116 415 logging.cc:43] INVALID_STATE: std::exception
E0418 17:27:42.757122 415 logging.cc:43] INVALID_CONFIG: Deserialize the cuda engine failed.
E0418 17:27:42.872995 415 logging.cc:43] coreReadArchive.cpp (32) - Serialization Error in verifyHeader: 0 (Magic tag does not match)
E0418 17:27:42.873026 415 logging.cc:43] INVALID_STATE: std::exception
E0418 17:27:42.873032 415 logging.cc:43] INVALID_CONFIG: Deserialize the cuda engine failed.
W0418 17:27:42.880388 415 autofill.cc:225] Autofiller failed to detect the platform for tacotron2_decoder_postnet (verify contents of model directory or use --log-verbose=1 for more details)
 ...
 ...
| Model                         | Version | Status                                 |
+-------------------------------+---------+----------------------------------------+
...
| tacotron2_decoder_postnet     | 1       | READY                                  |
| tacotron2_ensemble            | -       | Not loaded: No model version was found |
| tts_preprocessor              | 1       | READY                                  |
...
...

I0418 17:30:15.524788 56 server.cc:235] Timeout 30: Found 17 live models and 0 in-flight non-inference requests
   > Jarvis waiting for Triton server to load all models...retrying in 1 second
I0418 17:30:16.525058 56 server.cc:235] Timeout 29: Found 1 live models and 0 in-flight non-inference requests

Any help or tips would be appreciated!

SunilJB · April 19, 2021, 5:38am

Hi @artem5,
Could you please share the complete error log (console output) and system details (GPU Type, Windows/Linux Version, docker version etc.) so we can help better?

Thanks

amin3 · April 21, 2021, 5:50pm

I’ve been receiving some similar output. Here is the output of docker logs jarvis-speech:

Jarvis waiting for Triton server to load all models…retrying in 1 second
I0421 16:34:06.496129 51 onnxruntime.cc:1728] TRITONBACKEND_Initialize: onnxruntime
I0421 16:34:06.496415 51 onnxruntime.cc:1738] Triton TRITONBACKEND API version: 1.0
I0421 16:34:06.496424 51 onnxruntime.cc:1744] ‘onnxruntime’ TRITONBACKEND API version: 1.0
Jarvis waiting for Triton server to load all models…retrying in 1 second
I0421 16:34:07.851614 51 pinned_memory_manager.cc:205] Pinned memory pool is created at ‘0x2033f0000’ with size 268435456
I0421 16:34:07.852185 51 cuda_memory_manager.cc:103] CUDA memory pool is created on device 0 with size 1000000000
E0421 16:34:07.869046 51 model_repository_manager.cc:1682] failed to open text file for read /data/models/jarvis-trt-jarvis_intent_weather-nn-bert-base-uncased/config.pbtxt: No such file or directory
E0421 16:34:07.872539 51 model_repository_manager.cc:1682] failed to open text file for read /data/models/jarvis_label_tokens_weather/config.pbtxt: No such file or directory
E0421 16:34:07.876931 51 model_repository_manager.cc:1160] Invalid argument: ensemble jarvis_intent_weather contains models that are not available: jarvis-trt-jarvis_intent_weather-nn-bert-base-uncased, jarvis_label_tokens_weather
I0421 16:34:07.877045 51 model_repository_manager.cc:787] loading: jarvis-trt-jarvis_ner-nn-bert-base-uncased:1
I0421 16:34:07.977731 51 model_repository_manager.cc:787] loading: jarvis-trt-jarvis_punctuation-nn-bert-base-uncased:1
I0421 16:34:08.078528 51 model_repository_manager.cc:787] loading: jarvis-trt-jarvis_qa-nn-bert-base-uncased:1
I0421 16:34:08.179930 51 model_repository_manager.cc:787] loading: jarvis-trt-jarvis_text_classification_domain-nn-bert-base-uncased:1
I0421 16:34:08.281365 51 model_repository_manager.cc:787] loading: jarvis-trt-jasper:1
I0421 16:34:08.382578 51 model_repository_manager.cc:787] loading: jarvis-trt-tacotron2_encoder:1
Jarvis waiting for Triton server to load all models…retrying in 1 second
I0421 16:34:08.484958 51 model_repository_manager.cc:787] loading: jarvis-trt-waveglow:1
I0421 16:34:08.587122 51 model_repository_manager.cc:787] loading: jarvis_detokenize:1
I0421 16:34:08.689370 51 model_repository_manager.cc:787] loading: jarvis_ner_label_tokens:1
I0421 16:34:08.693862 51 custom_backend.cc:198] Creating instance jarvis_detokenize_0_0_cpu on CPU using libtriton_jarvis_nlp_detokenizer.so
I0421 16:34:08.714426 51 model_repository_manager.cc:960] successfully loaded ‘jarvis_detokenize’ version 1
I0421 16:34:08.790664 51 model_repository_manager.cc:787] loading: jarvis_punctuation_gen_output:1
I0421 16:34:08.790749 51 custom_backend.cc:198] Creating instance jarvis_ner_label_tokens_0_0_cpu on CPU using libtriton_jarvis_nlp_seqlabel.so
I0421 16:34:08.800779 51 model_repository_manager.cc:960] successfully loaded ‘jarvis_ner_label_tokens’ version 1
I0421 16:34:08.891157 51 model_repository_manager.cc:787] loading: jarvis_punctuation_label_tokens_cap:1
I0421 16:34:08.891214 51 custom_backend.cc:198] Creating instance jarvis_punctuation_gen_output_0_0_cpu on CPU using libtriton_jarvis_nlp_punctuation.so
I0421 16:34:08.898853 51 model_repository_manager.cc:960] successfully loaded ‘jarvis_punctuation_gen_output’ version 1
I0421 16:34:08.991919 51 model_repository_manager.cc:787] loading: jarvis_punctuation_label_tokens_punct:1
I0421 16:34:08.991972 51 custom_backend.cc:198] Creating instance jarvis_punctuation_label_tokens_cap_0_0_cpu on CPU using libtriton_jarvis_nlp_seqlabel.so
I0421 16:34:08.992487 51 model_repository_manager.cc:960] successfully loaded ‘jarvis_punctuation_label_tokens_cap’ version 1
I0421 16:34:09.093418 51 model_repository_manager.cc:787] loading: jarvis_punctuation_merge_labels:1
I0421 16:34:09.093488 51 custom_backend.cc:198] Creating instance jarvis_punctuation_label_tokens_punct_0_0_cpu on CPU using libtriton_jarvis_nlp_seqlabel.so
I0421 16:34:09.094028 51 model_repository_manager.cc:960] successfully loaded ‘jarvis_punctuation_label_tokens_punct’ version 1
I0421 16:34:09.193947 51 model_repository_manager.cc:787] loading: jarvis_qa_postprocessor:1
I0421 16:34:09.194125 51 custom_backend.cc:198] Creating instance jarvis_punctuation_merge_labels_0_0_cpu on CPU using libtriton_jarvis_nlp_labels.so
I0421 16:34:09.204057 51 model_repository_manager.cc:960] successfully loaded ‘jarvis_punctuation_merge_labels’ version 1
I0421 16:34:09.294694 51 model_repository_manager.cc:787] loading: jarvis_qa_preprocessor:1
I0421 16:34:09.294842 51 custom_backend.cc:198] Creating instance jarvis_qa_postprocessor_0_0_cpu on CPU using libtriton_jarvis_nlp_qa.so
I0421 16:34:09.333444 51 model_repository_manager.cc:960] successfully loaded ‘jarvis_qa_postprocessor’ version 1
I0421 16:34:09.395174 51 model_repository_manager.cc:787] loading: jarvis_tokenizer:1
I0421 16:34:09.395389 51 custom_backend.cc:198] Creating instance jarvis_qa_preprocessor_0_0_cpu on CPU using libtriton_jarvis_nlp_tokenizer.so
Jarvis waiting for Triton server to load all models…retrying in 1 second
I0421 16:34:09.462604 51 model_repository_manager.cc:960] successfully loaded ‘jarvis_qa_preprocessor’ version 1
I0421 16:34:09.495644 51 model_repository_manager.cc:787] loading: jasper-asr-trt-ensemble-vad-streaming-ctc-decoder-cpu-streaming:1
I0421 16:34:09.495692 51 custom_backend.cc:198] Creating instance jarvis_tokenizer_0_0_cpu on CPU using libtriton_jarvis_nlp_tokenizer.so
I0421 16:34:09.518993 51 model_repository_manager.cc:960] successfully loaded ‘jarvis_tokenizer’ version 1
I0421 16:34:09.597203 51 model_repository_manager.cc:787] loading: jasper-asr-trt-ensemble-vad-streaming-feature-extractor-streaming:1
I0421 16:34:09.597267 51 custom_backend.cc:198] Creating instance jasper-asr-trt-ensemble-vad-streaming-ctc-decoder-cpu-streaming_0_0_cpu on CPU using libtriton_jarvis_asr_decoder_cpu.so
I0421 16:34:09.698796 51 model_repository_manager.cc:787] loading: jasper-asr-trt-ensemble-vad-streaming-offline-ctc-decoder-cpu-streaming-offline:1
I0421 16:34:09.698820 51 custom_backend.cc:201] Creating instance jasper-asr-trt-ensemble-vad-streaming-feature-extractor-streaming_0_0_gpu0 on GPU 0 (6.1) using libtriton_jarvis_asr_features.so
/opt/jarvis/bin/start-jarvis: line 4: 51 Segmentation fault tritonserver --log-verbose=0 --strict-model-config=true $model_repos --cuda-memory-pool-byte-size=0:1000000000
Triton server died before reaching ready state. Terminating Jarvis startup.
Check Triton logs with: docker logs
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec … or kill -l [sigspec]`

SunilJB · April 26, 2021, 1:00am

Hi @amin3,

We are looking into it, will let you know in case if any updates.

Thanks

atomynosatom · May 2, 2021, 9:59am

This error
/data/models/jarvis-trt-jarvis_intent_weather-nn-bert-base-uncased/config.pbtxt: No such file or directory

I was getting that same error. It was because when running jarvis_init it failed on some models. So to fix that I ran this command
docker run --init -it --rm --gpus "device=0" -v jarvis-model-repo:/data --name jarvis-speech-maker nvcr.io/nvidia/jarvis/jarvis-speech:1.0.0-b.3-server /bin/bash
With that I got a shell into the container and went to the /data/models and then removed all the models that gave the config.pbtxt error.
(this is probably because it failed to generate the model when running the jarvis_init)
Then I ran the jarvis_init again.
I kept doing this till most the models gave no error.
But I kept on getting an error with jarvis-trt-waveglow where it said
Exception: build_waveglow failed to generate waveglow.eng.
Otherwise all the others didn’t give the config.pbtxt error.
I also do get get a cuda error in copytodevice: 2 (out of memory) error when trying to start the jarvis server probably because I don’t have enough gpu memory.

SunilJB · May 3, 2021, 5:58am

Hi @atomynosatom

Could you please try commenting out all the NLP models except 1 and see if that deploys successfully on your setup.

Thanks

atomynosatom · May 4, 2021, 4:41am

I have done that but it fails to deploy because of waveglow and I still get the out of memory error. I am pretty sure that it is because of the gpu having only 8GB vram while the minimum is 16

Topic		Replies	Views
Waiting for Jarvis server to load all models...retrying in 10 seconds Riva riva	7	2622	April 30, 2021
Jarvis: Triton server died before reaching ready state. Terminating Jarvis startup Riva riva	6	2103	October 12, 2021
Build_waveglow failed to generate waveglow.eng. (in jarvis_init.sh) Riva riva	6	1037	September 22, 2021
Error during jarvis_init.sh for jarvis 1.2.1 beta Riva riva	7	1683	December 22, 2021
"./jarvis_start.sh" timed out Riva riva	20	2085	May 21, 2021
Jarvis: Triton server timed-out before ready state Riva riva	2	1872	July 5, 2021
Jarvis Installation Issue: "Waiting for Jarvis server to load all models...retrying in 10 seconds" when running sudo bash jarvis_start.sh Riva riva	3	1617	July 8, 2021
Trying to run jarvis Riva riva	3	831	March 18, 2021
Triton server died before reaching ready state. Terminating Jarvis startup Riva riva	6	6429	August 20, 2021
Health ready check failed error whe running bash jarvis_start.sh Riva riva	4	1646	May 28, 2021

Jarvis_start.sh times out

Related topics