Hello,
I am working on deploying Morpheus in the cloud (AWS EKS) using the Cloud Deployment Guide but am getting the following an error trying to create an MLflow deployment using the provided models. I am following the guide exactly as written.
Any help or guidance on how to proceed here would be much appreciated!
Steps followed - Morpheus Cloud Deployment Guide - NVIDIA Docs
Environment -
- Morpheus 23.07
- Kubernetes 1.27
- EC2 Instance - g4dn.2xlarge
# kubectl -n $NAMESPACE get all
NAME READY STATUS RESTARTS AGE
pod/ai-engine-b5dd6fc65-lgbmp 1/1 Running 0 22h
pod/broker-8555cf8678-s4pdq 1/1 Running 0 22h
pod/mlflow-9dc885464-4gggd 1/1 Running 0 22h
pod/sdk-cli-helper 1/1 Running 0 139m
pod/zookeeper-6458958bc9-lb2jq 1/1 Running 0 22h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/ai-engine ClusterIP 172.20.194.146 <none> 8000/TCP,8001/TCP,8002/TCP 22h
service/broker ClusterIP 172.20.219.166 <none> 9092/TCP 22h
service/broker-external NodePort 172.20.118.240 <none> 9092:30092/TCP 22h
service/mlflow NodePort 172.20.162.197 <none> 5000:30500/TCP 22h
service/notebook NodePort 172.20.36.124 <none> 8888:30888/TCP 139m
service/zookeeper ClusterIP 172.20.114.237 <none> 2181/TCP 22h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/ai-engine 1/1 1 1 22h
deployment.apps/broker 1/1 1 1 22h
deployment.apps/mlflow 1/1 1 1 22h
deployment.apps/zookeeper 1/1 1 1 22h
NAME DESIRED CURRENT READY AGE
replicaset.apps/ai-engine-b5dd6fc65 1 1 1 22h
replicaset.apps/broker-8555cf8678 1 1 1 22h
replicaset.apps/mlflow-9dc885464 1 1 1 22h
replicaset.apps/zookeeper-6458958bc9 1 1 1 22h
First, I try to publish the model as described in the guide, and this exits without any error -
(mlflow) root@mlflow-9dc885464-4gggd:/mlflow# python publish_model_to_mlflow.py \
> --model_name sid-minibert-onnx \
> --model_directory /common/models/triton-model-repo/sid-minibert-onnx \
> --flavor triton
Registered model 'sid-minibert-onnx' already exists. Creating a new version of this model...
Created version '2' of model 'sid-minibert-onnx'.
/mlflow/artifacts/0/8687820bbc674584b5e125093ea80c22/artifacts
This appears to work, and I see the model in MLflow UI, but then I get this error creating the deployment -
(mlflow) root@mlflow-9dc885464-4gggd:/mlflow# mlflow deployments create -t triton --flavor triton --name sid-minibert-onnx -m models:/sid-minibert-onnx/1 -C "version=1"
Saved mlflow-meta.json to /common/triton-model-repo/sid-minibert-onnx
Traceback (most recent call last):
File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/mlflow_triton/deployments.py", line 110, in create_deployment
self.triton_client.load_model(name)
File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/tritonclient/http/_client.py", line 663, in load_model
_raise_if_error(response)
File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/tritonclient/http/_utils.py", line 69, in _raise_if_error
raise error
tritonclient.utils.InferenceServerException: [400] failed to load 'sid-minibert-onnx', failed to poll from model repository
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/envs/mlflow/bin/mlflow", line 8, in <module>
sys.exit(cli())
^^^^^
File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/mlflow/deployments/cli.py", line 146, in create_deployment
deployment = client.create_deployment(name, model_uri, flavor, config=config_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/mlflow_triton/deployments.py", line 112, in create_deployment
raise MlflowException(str(ex))
mlflow.exceptions.MlflowException: [400] failed to load 'sid-minibert-onnx', failed to poll from model repository
This is the error on the ai-engine
pod -
E0201 18:15:56.857982 1 model_repository_manager.cc:1306] Poll failed for model directory 'sid-minibert-onnx': Invalid model name: Could not determine backend for model 'sid-minibert-onnx' with no backend in model configuration. Expected model name of the form 'model.<backend_name>'.
While digging into this, I noticed the model directory “1” within TRITON_MODEL_REPO
appears to be empty -
(mlflow) root@mlflow-9dc885464-4gggd:/mlflow# ls -laR /common/triton-model-repo/sid-minibert-onnx/
/common/triton-model-repo/sid-minibert-onnx/:
total 20
drwxr-xr-x 3 root root 4096 Feb 1 16:22 .
drwxr-xr-x 4 root root 4096 Feb 1 17:47 ..
drwxr-xr-x 2 root root 4096 Feb 1 16:22 1
-rw-r--r-- 1 root root 186 Feb 1 18:15 mlflow-meta.json
-rw-r--r-- 1 root root 49 Feb 1 18:15 registered_model_meta
/common/triton-model-repo/sid-minibert-onnx/1:
total 8
drwxr-xr-x 2 root root 4096 Feb 1 16:22 .
drwxr-xr-x 3 root root 4096 Feb 1 16:22 ..
root@ai-engine-b5dd6fc65-lgbmp:/opt/tritonserver# ls -laR /common/triton-model-repo/sid-minibert-onnx/
/common/triton-model-repo/sid-minibert-onnx/:
total 20
drwxr-xr-x 3 root root 4096 Feb 1 16:22 .
drwxr-xr-x 4 root root 4096 Feb 1 17:47 ..
drwxr-xr-x 2 root root 4096 Feb 1 16:22 1
-rw-r--r-- 1 root root 186 Feb 1 18:15 mlflow-meta.json
-rw-r--r-- 1 root root 49 Feb 1 18:15 registered_model_meta
/common/triton-model-repo/sid-minibert-onnx/1:
total 8
drwxr-xr-x 2 root root 4096 Feb 1 16:22 .
drwxr-xr-x 3 root root 4096 Feb 1 16:22 ..