Morpheus cloud deployment "unable to poll from model repository"

synthetix-dennis · February 1, 2024, 6:26pm

Hello,

I am working on deploying Morpheus in the cloud (AWS EKS) using the Cloud Deployment Guide but am getting the following an error trying to create an MLflow deployment using the provided models. I am following the guide exactly as written.

Any help or guidance on how to proceed here would be much appreciated!

Steps followed - Morpheus Cloud Deployment Guide - NVIDIA Docs

Environment -

Morpheus 23.07
Kubernetes 1.27
EC2 Instance - g4dn.2xlarge

# kubectl -n $NAMESPACE get all
NAME                             READY   STATUS    RESTARTS   AGE
pod/ai-engine-b5dd6fc65-lgbmp    1/1     Running   0          22h
pod/broker-8555cf8678-s4pdq      1/1     Running   0          22h
pod/mlflow-9dc885464-4gggd       1/1     Running   0          22h
pod/sdk-cli-helper               1/1     Running   0          139m
pod/zookeeper-6458958bc9-lb2jq   1/1     Running   0          22h

NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/ai-engine         ClusterIP   172.20.194.146   <none>        8000/TCP,8001/TCP,8002/TCP   22h
service/broker            ClusterIP   172.20.219.166   <none>        9092/TCP                     22h
service/broker-external   NodePort    172.20.118.240   <none>        9092:30092/TCP               22h
service/mlflow            NodePort    172.20.162.197   <none>        5000:30500/TCP               22h
service/notebook          NodePort    172.20.36.124    <none>        8888:30888/TCP               139m
service/zookeeper         ClusterIP   172.20.114.237   <none>        2181/TCP                     22h

NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ai-engine   1/1     1            1           22h
deployment.apps/broker      1/1     1            1           22h
deployment.apps/mlflow      1/1     1            1           22h
deployment.apps/zookeeper   1/1     1            1           22h

NAME                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/ai-engine-b5dd6fc65    1         1         1       22h
replicaset.apps/broker-8555cf8678      1         1         1       22h
replicaset.apps/mlflow-9dc885464       1         1         1       22h
replicaset.apps/zookeeper-6458958bc9   1         1         1       22h

First, I try to publish the model as described in the guide, and this exits without any error -

(mlflow) root@mlflow-9dc885464-4gggd:/mlflow# python publish_model_to_mlflow.py \
>       --model_name sid-minibert-onnx \
>       --model_directory /common/models/triton-model-repo/sid-minibert-onnx \
>       --flavor triton
Registered model 'sid-minibert-onnx' already exists. Creating a new version of this model...
Created version '2' of model 'sid-minibert-onnx'.
/mlflow/artifacts/0/8687820bbc674584b5e125093ea80c22/artifacts

This appears to work, and I see the model in MLflow UI, but then I get this error creating the deployment -

(mlflow) root@mlflow-9dc885464-4gggd:/mlflow# mlflow deployments create -t triton       --flavor triton       --name sid-minibert-onnx       -m models:/sid-minibert-onnx/1       -C "version=1"
Saved mlflow-meta.json to /common/triton-model-repo/sid-minibert-onnx
Traceback (most recent call last):
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/mlflow_triton/deployments.py", line 110, in create_deployment
    self.triton_client.load_model(name)
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/tritonclient/http/_client.py", line 663, in load_model
    _raise_if_error(response)
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/tritonclient/http/_utils.py", line 69, in _raise_if_error
    raise error
tritonclient.utils.InferenceServerException: [400] failed to load 'sid-minibert-onnx', failed to poll from model repository

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/mlflow/bin/mlflow", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/mlflow/deployments/cli.py", line 146, in create_deployment
    deployment = client.create_deployment(name, model_uri, flavor, config=config_dict)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/mlflow_triton/deployments.py", line 112, in create_deployment
    raise MlflowException(str(ex))
mlflow.exceptions.MlflowException: [400] failed to load 'sid-minibert-onnx', failed to poll from model repository

This is the error on the ai-engine pod -

E0201 18:15:56.857982 1 model_repository_manager.cc:1306] Poll failed for model directory 'sid-minibert-onnx': Invalid model name: Could not determine backend for model 'sid-minibert-onnx' with no backend in model configuration. Expected model name of the form 'model.<backend_name>'.

While digging into this, I noticed the model directory “1” within TRITON_MODEL_REPO appears to be empty -

(mlflow) root@mlflow-9dc885464-4gggd:/mlflow# ls -laR /common/triton-model-repo/sid-minibert-onnx/
/common/triton-model-repo/sid-minibert-onnx/:
total 20
drwxr-xr-x 3 root root 4096 Feb  1 16:22 .
drwxr-xr-x 4 root root 4096 Feb  1 17:47 ..
drwxr-xr-x 2 root root 4096 Feb  1 16:22 1
-rw-r--r-- 1 root root  186 Feb  1 18:15 mlflow-meta.json
-rw-r--r-- 1 root root   49 Feb  1 18:15 registered_model_meta

/common/triton-model-repo/sid-minibert-onnx/1:
total 8
drwxr-xr-x 2 root root 4096 Feb  1 16:22 .
drwxr-xr-x 3 root root 4096 Feb  1 16:22 ..

root@ai-engine-b5dd6fc65-lgbmp:/opt/tritonserver# ls -laR /common/triton-model-repo/sid-minibert-onnx/
/common/triton-model-repo/sid-minibert-onnx/:
total 20
drwxr-xr-x 3 root root 4096 Feb  1 16:22 .
drwxr-xr-x 4 root root 4096 Feb  1 17:47 ..
drwxr-xr-x 2 root root 4096 Feb  1 16:22 1
-rw-r--r-- 1 root root  186 Feb  1 18:15 mlflow-meta.json
-rw-r--r-- 1 root root   49 Feb  1 18:15 registered_model_meta

/common/triton-model-repo/sid-minibert-onnx/1:
total 8
drwxr-xr-x 2 root root 4096 Feb  1 16:22 .
drwxr-xr-x 3 root root 4096 Feb  1 16:22 ..

pmackinnon · February 1, 2024, 8:38pm

This is a bug in the mlflow plugin deployment code that should be addressed in the next release. Please note that it cannot be consistently re-produced in all environments.

In some environments, the plugin is not copying over the confgpb.txt required by Triton for the model.

\Pete

synthetix-dennis · February 1, 2024, 8:52pm

Thanks for the pointer, Pete.

It seems to be consistently happening for me in my environment - even after doing a uninstall/install or rotating the k8s node. Please let me know if you’d like any specific information about my environment, if that is helpful for you to assess the root cause or impact.

For those looking for a workaround, you can manually copy the files to /common/triton-model-repo and then create the deployment again.

cp -r /common/models/triton-model-repo/sid-minibert-onnx/ /common/triton-model-repo/

For example -

(mlflow) root@mlflow-9dc885464-5hj58:/mlflow# mlflow deployments create -t triton \
>       --flavor triton \
>       --name sid-minibert-onnx \
>       -m models:/sid-minibert-onnx/1 \
>       -C "version=1"
Saved mlflow-meta.json to /common/triton-model-repo/sid-minibert-onnx
Traceback (most recent call last):
  .... SNIP ....
tritonclient.utils.InferenceServerException: [400] failed to load 'sid-minibert-onnx', failed to poll from model repository


(mlflow) root@mlflow-9dc885464-5hj58:/mlflow# cp -r /common/models/triton-model-repo/sid-minibert-onnx/ /common/triton-model-repo/

(mlflow) root@mlflow-9dc885464-5hj58:/mlflow# mlflow deployments create -t triton       --flavor triton       --name sid-minibert-onnx       -m models:/sid-minibert-onnx/1       -C "version=1"
Saved mlflow-meta.json to /common/triton-model-repo/sid-minibert-onnx
triton deployment sid-minibert-onnx is created

system · February 15, 2024, 8:53pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failed to load 'sid-minibert-onnx', no version is available Cybersecurity	1	832	September 20, 2022
Morpheus and MRC are quite difficult to get examples running TensorRT	1	491	September 20, 2023
NVIDIA Morpheus runtime error: Model is not ready Cybersecurity	7	50	May 14, 2025
Build Tritonserver from source has file missing error TensorRT tensorrt , inference-server-triton	1	969	February 9, 2023
Morpheus pipeline (miniconda version) test run failed: ImportError: cannot import name 'ControlMessage' Cybersecurity	5	549	December 22, 2023
`Error No Op registered for NMSDynamic_TRT...` when trying to run Trition inference server with a SSD model TAO Toolkit jetson	12	1247	October 12, 2023
Error in inferencing using a onnx faster rcnn model DeepStream SDK	10	1553	October 12, 2021
Regarding when we execute triton server on jetson orin getting an error unable to load model DeepStream SDK cuda	19	796	July 30, 2024
Custom Detection parser error with nvinferserver and custom python model with > 1 streams DeepStream SDK inference-server-triton , gpu , deepstream	18	1106	September 4, 2023
Error when using ensemble model with deepstream-5.1 : failed to get input buffer in CPU memory DeepStream SDK inference-server-triton	7	1202	September 4, 2021

Morpheus cloud deployment "unable to poll from model repository"

Related topics