Clara Deploy SDK stuck at "Wait until TRTIS is Ready"

Hey there,

I’m having a hard time running Clara Deploy SDK and got stuck at “Wait until TRTIS is Ready” as it’s mentioned below while running as per the following https://ngc.nvidia.com/catalog/containers/nvidia:clara:ai-covid-19 link. And I kindly request you to look into it and let me know the resolution and what the problem via the content that is as follows:

ubuntu@ip-172-31-22-181:~$ ./run_docker.sh
Error response from daemon: network with name container-demo already exists
0538adc89e9e0961808bd6c687bb5fa79035ab9c7bf7e812ec5742acc99da427
nvidia-docker run --name trtis --network container-demo -d --rm --shm-size=1g --ulimi
t memlock=-1 --ulimit stack=67108864 -p 8000:8000 -v /home/ubuntu/models/classificati
on_covid-19_v1:/models/classification_covid-19_v1 nvcr.io/nvidia/tensorrtserver:19.08
-py3 trtserver --model-store=/models
Wait until TRTIS 172.18.0.2 is ready…

You may observe in the above-posted screenshots that it got stuck while executing the following shell script at https://ngc.nvidia.com/catalog/containers/nvidia:clara:ai-covid-19

./run_docker.sh

I did the above one soon after setting up with the reference pipeline using the following link https://ngc.nvidia.com/catalog/model-scripts/nvidia:clara:clara_ai_covid19_pipeline

Could you at least let me know where exactly I went wrong? Or Did I do anything in a different order? Or Didn’t store the files in the appropriate directories? I’m kinda confused here!

Or Do I have to do anything with the NVIDIA Triton Inference Server (Triton), formerly known as TensorRT Inference Server (TRTIS)?

And I was also following the Clara Deploy SDK Docs https://docs.nvidia.com/clara/deploy/index.html page while doing the above ones.

But I guess but I don’t know exactly if I’m correct in this regard that the problem is with the NVIDIA Triton Inference Server (Triton), formerly known as TensorRT Inference Server (TRTIS). Kindly please look into it and let me know ASAP.

Hey Nvidia, Why don’t you Nvidia Engineers get involved in the developer forum for supporting your developer/enterprise customers in the first place? I see a lot of negligence from you guys in this regard. When you don’t provide the resolution to the issues on time, then why do you have such a developer forum in the first place? It’s really very disappointing, discouraging, and frustrating experience with the DevTalk forum that has no point of contact to such a tech support system in the first place. It’s very terrible and horrible experience working the Nvidia products with such a bad support system in place. I’m ending here with all my frustration on DevTalk forum. Good Bye!

Hello - when you see TRITON hang like this, it is likely that there an error has happened in either your wrapped container or the TRITON container itself. You can troubleshoot this by doing:

docker logs
(see https://docs.docker.com/engine/reference/commandline/logs/)

You can get a list of containers by doing
docker container ls
(see https://docs.docker.com/engine/reference/commandline/container_ls/)

You’ll probably need to open a second terminal session (since the first one will be waiting for TRITON to start).

You should see what has caused it to stop processing (e.g., it could not find your model directory, an error on the name of a reference, etc). What do you see when you check those logs?

1 Like

I’ll follow up on @bgenereaux reply with more specific info.

It was the Trtion (formerly TRTIS) instance did not start successfully, and it is a known issue on the platform side which requires (re)configuration of available GPU regardless of number of actual GPU’s are present. Please see this this.

The COVID-19 reference pipeline consists of two AI inference operators (containers), and the Platform tries to create two instances of the Triton Inference Server each with a GPU.

So, please change the availableGpus to be greater than 1 (even if only one physical GPU is present). Restart the platform and trying creating and running the pipeline again.

Hello! Ming,

So what should I do to change the available GPUS while I only have one GPU actually? Or how could I reconfiguration available GPUs?

Hi jingyq1,

Thanks for the question. Can you please specify which version of the Clara Deploy are you using now?

The thread is 5 months old, and Clara Deploy has been improving along the way. A feature will be available soon to ensure pipeline with multiple GPU request can be executed on a single GPU host.

In the meantime, the need specify availableGPU has been removed. If you operators do not need to use GPU, then there is no need to specify GPU request, and if you do need to use Trion Inference Server, one GPU is sufficient (assuming the GPU is of the supported architecture with sufficient GPU memory).

Regards,
Ming

Hi Ming,

The Clara CLI version: 0.7.1-12788.ae65aea0
The Clara Platform version: 0.7.1-12788.ae65aea0

I just installed the Clara Deploy last week, so I believe it is the latest version for now.

Because it is an emergency research project, it may be better for me to know how to change the available GPU number for the current version. By the way, may I know when will the new feature be available soon? If the new feature will be available in one week, then it will work for me.

Thank you for your time!

Best,
Jingyuan

Per release note

Platform Server now automatically detects the number and type of GPU available in the cluster. Starting with release v0.7.1 Platform Server no longer honors the availableGpus configuration option in the values.yaml Helm chart.

So your single GPU system should be allow you to run some pipelines. If in doubt, please attach the pipeline definition file, and I can help to review it.

I am not sure which is the pipeline definition file and where I could find it. I would like to use “Clara Deploy AI COVID-19 Classification Operator”, for which the script should download a model called “segmentation_ct_lung_v1”. But when I run the script showed online, the program just stacked and showed “wait until TRTIS is ready”.

This operator, along with a few others, are in the COVID-19 Classification Pipeline. The overview and setup guide etc are on the NGC page.

The COVID-19 Classification Operator itself takes in two input, the lung CT volume image (converted from CT Chest DICOM series) as well as the lung segmentation volume image (output from the ai-lung segmentation operator). So to just run the operator(s) standalone manually you need to run the ai-lung first with the downloaded model and input data first, then the COVID-19 Classification operator.
The ai-lung operator will cause the Triton Inference Server to use over 8GB of GPU memory.

Actually, the stack happened when I would like to run the segmentation part. I tried to run the “Clara Deploy AI Lung Segmentation Operator”, but I could not find the model “segmentation_lung_v1” from the MODEL section on NGC. In the script provide on NGC, it said “If blank, all available models will be loaded.” One possible model is the “clara_train_covid19_ct_lung_seg”, but the name is not as same as the requirement of the segmentation operator. Could you direct me to the page where I could download the model?

Please download the models (both ct_lung seg and COVID-19 classification) from aforementioned link, https://ngc.nvidia.com/resources/nvidia:clara:clara_ai_covid19_pipeline

The statement on loading all models is in the context that all the models (folder etc) have made available in the repository (file system in this case) of Triton inference server.

Hi Ming,

I have found the models, Thank you.

But when I run the program today, I got another problem. I tried to reinstall the SDK, but it does not help.

When I would like to run “clara pull platform”, it returns “Error: Looks like “https://helm.ngc.nvidia.com/nvidia/clara” is not a valid chart repository or cannot be reached: Failed to fetch https://helm.ngc.nvidia.com/nvidia/clara/index.yaml : 401 Unauthorized”. I am pretty sure that my APIA-KEY and the username is correct, generated by NGC setup.
There are some states:

Sorry for the inconvenience.

The issue with pulling Clara Platform has been resolved. Please try again.

Combination of fixes applied to fix this:
1) Public read access granted to  `nvidia/clara`
2) Authenticated users will gain access to all public repos now