When launching a Mistral Instruct v0.3 NIM container on a ml.g5.4xlarge AWS Sagemaker instance, we get the following error:
ValueError: The model’s max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (16912). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
How can we specify a value for max_model_len when launching the SageMaker endpoint?
We tried setting “MAX_MODEL_LEN” and “OPTION_MAX_MODEL_LEN” as environment variable when creating the model but this approach didn’t work.
what is the configuration of the ml.g5.4xlarge instance, in particular GPU memory capacity?
As NIM currently doesn’t support max sequence length as a dynamic parameter (via environment variable, see Configuring a NIM - NVIDIA Docs), I think we need to use an instance with sufficient GPU memory (A100 80Gb should be supported).
Else set the max_model_len parameter is to override the container entry point and pass it in as a command line flag, along the lines of this command:
Looks like the g5.4xlarge has 1x 24Gb GPU. This model at this sequence length would require at least 1x80Gb GPU (such as A100, H100) or 2x48Gb GPU (such as L40s).
The reason I am asking this question is that by setting max_model_len to 16000 (for example), I managed to deploy the model on a ml.g5.4xlarge machine using AWS LMI containers. While switching to a more powerful machine would solve the problem, hosting the 7B parameter model on such an instance would be a waste of compute resources.
The problem that I raised pertains to a subset of problems I am facing while moving away from AWS LMI to NVIDIA NIM. Specifically, the nature of the applications I am working with requires utilizing prefix caching from vLLM (as well as accessing other vLLM properties) or setting a specific batch size and potentially a different model distribution strategy (model sharding or model replicas on each GPU). Looking through the API, I could not find a reference to accessing these properties.
Same problem here. Would love to see whether or not it is possible to alter MAX_MODEL_LEN or max_position_embeddings somehow.
I tried altering profile’s setting in model_manifest.yaml file by adding max_model_len: '14000' and later gpu_memory_utilization: '0.5' but even though the value was listed in the logs, it was not considered at the end and I ended up with same error as OP.