Morpheus cloud deployment "unable to poll from model repository"

Hello,

I am working on deploying Morpheus in the cloud (AWS EKS) using the Cloud Deployment Guide but am getting the following an error trying to create an MLflow deployment using the provided models. I am following the guide exactly as written.

Any help or guidance on how to proceed here would be much appreciated!

Steps followed - Morpheus Cloud Deployment Guide - NVIDIA Docs

Environment -

  • Morpheus 23.07
  • Kubernetes 1.27
  • EC2 Instance - g4dn.2xlarge
# kubectl -n $NAMESPACE get all
NAME                             READY   STATUS    RESTARTS   AGE
pod/ai-engine-b5dd6fc65-lgbmp    1/1     Running   0          22h
pod/broker-8555cf8678-s4pdq      1/1     Running   0          22h
pod/mlflow-9dc885464-4gggd       1/1     Running   0          22h
pod/sdk-cli-helper               1/1     Running   0          139m
pod/zookeeper-6458958bc9-lb2jq   1/1     Running   0          22h

NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/ai-engine         ClusterIP   172.20.194.146   <none>        8000/TCP,8001/TCP,8002/TCP   22h
service/broker            ClusterIP   172.20.219.166   <none>        9092/TCP                     22h
service/broker-external   NodePort    172.20.118.240   <none>        9092:30092/TCP               22h
service/mlflow            NodePort    172.20.162.197   <none>        5000:30500/TCP               22h
service/notebook          NodePort    172.20.36.124    <none>        8888:30888/TCP               139m
service/zookeeper         ClusterIP   172.20.114.237   <none>        2181/TCP                     22h

NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ai-engine   1/1     1            1           22h
deployment.apps/broker      1/1     1            1           22h
deployment.apps/mlflow      1/1     1            1           22h
deployment.apps/zookeeper   1/1     1            1           22h

NAME                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/ai-engine-b5dd6fc65    1         1         1       22h
replicaset.apps/broker-8555cf8678      1         1         1       22h
replicaset.apps/mlflow-9dc885464       1         1         1       22h
replicaset.apps/zookeeper-6458958bc9   1         1         1       22h

First, I try to publish the model as described in the guide, and this exits without any error -

(mlflow) root@mlflow-9dc885464-4gggd:/mlflow# python publish_model_to_mlflow.py \
>       --model_name sid-minibert-onnx \
>       --model_directory /common/models/triton-model-repo/sid-minibert-onnx \
>       --flavor triton
Registered model 'sid-minibert-onnx' already exists. Creating a new version of this model...
Created version '2' of model 'sid-minibert-onnx'.
/mlflow/artifacts/0/8687820bbc674584b5e125093ea80c22/artifacts

This appears to work, and I see the model in MLflow UI, but then I get this error creating the deployment -

(mlflow) root@mlflow-9dc885464-4gggd:/mlflow# mlflow deployments create -t triton       --flavor triton       --name sid-minibert-onnx       -m models:/sid-minibert-onnx/1       -C "version=1"
Saved mlflow-meta.json to /common/triton-model-repo/sid-minibert-onnx
Traceback (most recent call last):
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/mlflow_triton/deployments.py", line 110, in create_deployment
    self.triton_client.load_model(name)
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/tritonclient/http/_client.py", line 663, in load_model
    _raise_if_error(response)
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/tritonclient/http/_utils.py", line 69, in _raise_if_error
    raise error
tritonclient.utils.InferenceServerException: [400] failed to load 'sid-minibert-onnx', failed to poll from model repository

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/mlflow/bin/mlflow", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/mlflow/deployments/cli.py", line 146, in create_deployment
    deployment = client.create_deployment(name, model_uri, flavor, config=config_dict)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/mlflow_triton/deployments.py", line 112, in create_deployment
    raise MlflowException(str(ex))
mlflow.exceptions.MlflowException: [400] failed to load 'sid-minibert-onnx', failed to poll from model repository

This is the error on the ai-engine pod -

E0201 18:15:56.857982 1 model_repository_manager.cc:1306] Poll failed for model directory 'sid-minibert-onnx': Invalid model name: Could not determine backend for model 'sid-minibert-onnx' with no backend in model configuration. Expected model name of the form 'model.<backend_name>'.

While digging into this, I noticed the model directory “1” within TRITON_MODEL_REPO appears to be empty -

(mlflow) root@mlflow-9dc885464-4gggd:/mlflow# ls -laR /common/triton-model-repo/sid-minibert-onnx/
/common/triton-model-repo/sid-minibert-onnx/:
total 20
drwxr-xr-x 3 root root 4096 Feb  1 16:22 .
drwxr-xr-x 4 root root 4096 Feb  1 17:47 ..
drwxr-xr-x 2 root root 4096 Feb  1 16:22 1
-rw-r--r-- 1 root root  186 Feb  1 18:15 mlflow-meta.json
-rw-r--r-- 1 root root   49 Feb  1 18:15 registered_model_meta

/common/triton-model-repo/sid-minibert-onnx/1:
total 8
drwxr-xr-x 2 root root 4096 Feb  1 16:22 .
drwxr-xr-x 3 root root 4096 Feb  1 16:22 ..
root@ai-engine-b5dd6fc65-lgbmp:/opt/tritonserver# ls -laR /common/triton-model-repo/sid-minibert-onnx/
/common/triton-model-repo/sid-minibert-onnx/:
total 20
drwxr-xr-x 3 root root 4096 Feb  1 16:22 .
drwxr-xr-x 4 root root 4096 Feb  1 17:47 ..
drwxr-xr-x 2 root root 4096 Feb  1 16:22 1
-rw-r--r-- 1 root root  186 Feb  1 18:15 mlflow-meta.json
-rw-r--r-- 1 root root   49 Feb  1 18:15 registered_model_meta

/common/triton-model-repo/sid-minibert-onnx/1:
total 8
drwxr-xr-x 2 root root 4096 Feb  1 16:22 .
drwxr-xr-x 3 root root 4096 Feb  1 16:22 ..

This is a bug in the mlflow plugin deployment code that should be addressed in the next release. Please note that it cannot be consistently re-produced in all environments.

In some environments, the plugin is not copying over the confgpb.txt required by Triton for the model.

\Pete

2 Likes

Thanks for the pointer, Pete.

It seems to be consistently happening for me in my environment - even after doing a uninstall/install or rotating the k8s node. Please let me know if you’d like any specific information about my environment, if that is helpful for you to assess the root cause or impact.

For those looking for a workaround, you can manually copy the files to /common/triton-model-repo and then create the deployment again.

cp -r /common/models/triton-model-repo/sid-minibert-onnx/ /common/triton-model-repo/

For example -

(mlflow) root@mlflow-9dc885464-5hj58:/mlflow# mlflow deployments create -t triton \
>       --flavor triton \
>       --name sid-minibert-onnx \
>       -m models:/sid-minibert-onnx/1 \
>       -C "version=1"
Saved mlflow-meta.json to /common/triton-model-repo/sid-minibert-onnx
Traceback (most recent call last):
  .... SNIP ....
tritonclient.utils.InferenceServerException: [400] failed to load 'sid-minibert-onnx', failed to poll from model repository


(mlflow) root@mlflow-9dc885464-5hj58:/mlflow# cp -r /common/models/triton-model-repo/sid-minibert-onnx/ /common/triton-model-repo/

(mlflow) root@mlflow-9dc885464-5hj58:/mlflow# mlflow deployments create -t triton       --flavor triton       --name sid-minibert-onnx       -m models:/sid-minibert-onnx/1       -C "version=1"
Saved mlflow-meta.json to /common/triton-model-repo/sid-minibert-onnx
triton deployment sid-minibert-onnx is created

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.