Mistral v.03 NIM container on AWS Sagemaker: cannot set max_model_len!

Adrian_Alecu · September 3, 2024, 9:42am

Hello,

When launching a Mistral Instruct v0.3 NIM container on a ml.g5.4xlarge AWS Sagemaker instance, we get the following error:

ValueError: The model’s max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (16912). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

How can we specify a value for max_model_len when launching the SageMaker endpoint?
We tried setting “MAX_MODEL_LEN” and “OPTION_MAX_MODEL_LEN” as environment variable when creating the model but this approach didn’t work.

Thank you!

vinhn · September 4, 2024, 2:28am

what is the configuration of the ml.g5.4xlarge instance, in particular GPU memory capacity?

As NIM currently doesn’t support max sequence length as a dynamic parameter (via environment variable, see Configuring a NIM - NVIDIA Docs), I think we need to use an instance with sufficient GPU memory (A100 80Gb should be supported).

Else set the max_model_len parameter is to override the container entry point and pass it in as a command line flag, along the lines of this command:

docker run -it --rm \
–gpus all \
–shm-size=16GB \
-e NGC_API_KEY \
-v “$LOCAL_NIM_CACHE:/opt/nim/.cache” \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.2 \
python3 -m vllm_nvext.entrypoints.openai.api_server --max-model-len 8192

vinhn · September 4, 2024, 2:40am

Looks like the g5.4xlarge has 1x 24Gb GPU. This model at this sequence length would require at least 1x80Gb GPU (such as A100, H100) or 2x48Gb GPU (such as L40s).

Adrian_Alecu · September 4, 2024, 8:49am

Hello,

Thank you for diving deep into this problem.

The reason I am asking this question is that by setting max_model_len to 16000 (for example), I managed to deploy the model on a ml.g5.4xlarge machine using AWS LMI containers. While switching to a more powerful machine would solve the problem, hosting the 7B parameter model on such an instance would be a waste of compute resources.

The problem that I raised pertains to a subset of problems I am facing while moving away from AWS LMI to NVIDIA NIM. Specifically, the nature of the applications I am working with requires utilizing prefix caching from vLLM (as well as accessing other vLLM properties) or setting a specific batch size and potentially a different model distribution strategy (model sharding or model replicas on each GPU). Looking through the API, I could not find a reference to accessing these properties.

joe_swanson · September 4, 2024, 8:57am

Same problem here. Would love to see whether or not it is possible to alter MAX_MODEL_LEN or max_position_embeddings somehow.
I tried altering profile’s setting in model_manifest.yaml file by adding max_model_len: '14000' and later gpu_memory_utilization: '0.5' but even though the value was listed in the logs, it was not considered at the end and I ended up with same error as OP.

Topic		Replies	Views
NIM_MAX_MODEL_LEN environment variable does nothing Models nim	1	108	November 21, 2024
Override max_num_seqs on nvcr.io/nim/meta/llama-3.2-11b-vision-instruct Models nim , llama	4	96	February 12, 2025
Model says there is a compatible profile but fails on data type Models nim , mistral-7b-instruct-v03	4	622	August 21, 2024
Reusing a stored model (llama-3.1-8b-instruct) with a proper profile Models nim , llama-31-8b-instruct , llama	0	142	October 30, 2024
Problem with installation of Llama 3.1 8b NIM Models nim , llama3-8b-instruct , llama-31-8b-instruct , llama	1	527	September 4, 2024
SM deployment Models nim , mistral-7b-instruct-v03	2	46	October 22, 2024
Issues while starting NIM container in A10 VM Models nim , llama3-8b-instruct	4	147	September 4, 2024
Assistance Required for API Call Error: Prompt Length Exceeds Maximum Input Length in TRTGptModel Models nim , mistral-7b-instruct-v03	0	69	December 20, 2024
Support Needed: `ValueError: No available memory for the cache blocks` with Mistral-Nemo-12B-Instruct on NVIDIA GeForce RTX 4090 (16GB) in Docker Models nim , mistral-nemo-12b-instruct	2	859	November 11, 2024
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte Models nim , mistral-7b-instruct-v03	0	90	November 12, 2024

Mistral v.03 NIM container on AWS Sagemaker: cannot set max_model_len!

Related topics