Clara Deploy SDK stuck at "Wait until TRTIS is Ready"

Hey there,

I’m having a hard time running Clara Deploy SDK and got stuck at “Wait until TRTIS is Ready” as it’s mentioned below while running as per the following https://ngc.nvidia.com/catalog/containers/nvidia:clara:ai-covid-19 link. And I kindly request you to look into it and let me know the resolution and what the problem via the content that is as follows:

ubuntu@ip-172-31-22-181:~$ ./run_docker.sh
Error response from daemon: network with name container-demo already exists
0538adc89e9e0961808bd6c687bb5fa79035ab9c7bf7e812ec5742acc99da427
nvidia-docker run --name trtis --network container-demo -d --rm --shm-size=1g --ulimi
t memlock=-1 --ulimit stack=67108864 -p 8000:8000 -v /home/ubuntu/models/classificati
on_covid-19_v1:/models/classification_covid-19_v1 nvcr.io/nvidia/tensorrtserver:19.08
-py3 trtserver --model-store=/models
Wait until TRTIS 172.18.0.2 is ready…

You may observe in the above-posted screenshots that it got stuck while executing the following shell script at https://ngc.nvidia.com/catalog/containers/nvidia:clara:ai-covid-19

./run_docker.sh

I did the above one soon after setting up with the reference pipeline using the following link https://ngc.nvidia.com/catalog/model-scripts/nvidia:clara:clara_ai_covid19_pipeline

Could you at least let me know where exactly I went wrong? Or Did I do anything in a different order? Or Didn’t store the files in the appropriate directories? I’m kinda confused here!

Or Do I have to do anything with the NVIDIA Triton Inference Server (Triton), formerly known as TensorRT Inference Server (TRTIS)?

And I was also following the Clara Deploy SDK Docs https://docs.nvidia.com/clara/deploy/index.html page while doing the above ones.

But I guess but I don’t know exactly if I’m correct in this regard that the problem is with the NVIDIA Triton Inference Server (Triton), formerly known as TensorRT Inference Server (TRTIS). Kindly please look into it and let me know ASAP.

Hey Nvidia, Why don’t you Nvidia Engineers get involved in the developer forum for supporting your developer/enterprise customers in the first place? I see a lot of negligence from you guys in this regard. When you don’t provide the resolution to the issues on time, then why do you have such a developer forum in the first place? It’s really very disappointing, discouraging, and frustrating experience with the DevTalk forum that has no point of contact to such a tech support system in the first place. It’s very terrible and horrible experience working the Nvidia products with such a bad support system in place. I’m ending here with all my frustration on DevTalk forum. Good Bye!

Hello - when you see TRITON hang like this, it is likely that there an error has happened in either your wrapped container or the TRITON container itself. You can troubleshoot this by doing:

docker logs
(see https://docs.docker.com/engine/reference/commandline/logs/)

You can get a list of containers by doing
docker container ls
(see https://docs.docker.com/engine/reference/commandline/container_ls/)

You’ll probably need to open a second terminal session (since the first one will be waiting for TRITON to start).

You should see what has caused it to stop processing (e.g., it could not find your model directory, an error on the name of a reference, etc). What do you see when you check those logs?

1 Like

I’ll follow up on @bgenereaux reply with more specific info.

It was the Trtion (formerly TRTIS) instance did not start successfully, and it is a known issue on the platform side which requires (re)configuration of available GPU regardless of number of actual GPU’s are present. Please see this this.

The COVID-19 reference pipeline consists of two AI inference operators (containers), and the Platform tries to create two instances of the Triton Inference Server each with a GPU.

So, please change the availableGpus to be greater than 1 (even if only one physical GPU is present). Restart the platform and trying creating and running the pipeline again.