SM deployment

Hello,

When trying to deploy various versions of the Mistral docker container on Sagemaker endpoints (using this NIM example as guideline), I came across the following peculiarity:

  • When using public.ecr.aws/nvidia/nim:mistral-7b-instruct-v03-1.0.0, SM deployment on a ml.g5.12xlarge instance finished successfully
  • When using nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest, SM deployment failed with error ModuleNotFoundError: No module named 'grpc'. You can run pip install “ray[serve]” to install all Ray Serve dependencies..

The stacktrace for the failure is:
File "/opt/nim/llm/.venv/bin/serve", line 5, in <module>

from ray.serve.scripts import cli File "/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/serve/__init__.py", line 29, in <module>
raise e File "/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/serve/__init__.py", line 4, in <module>
from ray.serve.api import ( File "/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/serve/api.py", line 14, in <module>
from ray.serve._private.config import ( File "/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/serve/_private/config.py", line 30, in <module>
from ray.serve._private.utils import DEFAULT, DeploymentOptionUpdateType File "/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/serve/_private/utils.py", line 28, in <module>
from ray.serve._private.common import ServeComponentType File "/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/serve/_private/common.py", line 23, in <module>
from ray.serve.grpc_util import RayServegRPCContext File "/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/serve/grpc_util.py", line 3, in <module>
import grpc

Do you know why the latest container cannot be deployed on SageMaker?

Hi @Adrian_Alecu – there are some modifications that have to be made to NIM in order to deploy on Sagemaker, so you’ll need to use the images on ECR, for example

public.ecr.aws/nvidia/nim:mistral-7b-instruct-v0.3-1.1.2

Alternatively, you can follow the instructions here to customize your launch environment for Sagemaker.

Hello,

I tried with public.ecr.aws/nvidia/nim:mistral-7b-instruct-v0.3-1.1.2, but it didn’t work due to the following reason. This version of NIM comes with NVIDIA driver version >= 535 while Sagemaker instances from the family ml.g5.* come with NVIDIA driver 470.57.02. This incompatibility prevents the successful deployment of the NIM container.