Override max_num_seqs on nvcr.io/nim/meta/llama-3.2-11b-vision-instruct

I have the following RTX 8000, 48 GB with only GPU.

I was able to run the model reducing the context to 16000, using the env var -e NIM_MAX_MODEL_LEN=16000, but when container is starting, it shows my this kind of stuff:

ERROR 2025-01-24 19:27:36.16 llm_engine.py:507] CUDA out of memory error: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 0 has a total capacity of 47.45 GiB of which 413.31 MiB is free. Process 2212352 has 47.05 GiB memory in use. Of the allocated memory 46.68 GiB is allocated by PyTorch, and 178.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.5 documentation). Reducing max_num_seqs by half: 128.

It goes till value 32 and then it starts the container.

I´m just testing so I tried to override max_num_seqs with value 1, reading the documentation I found this:

NIM_RELAX_MEM_CONSTRAINTS

If set to 1 and NIM_NUM_KV_CACHE_SEQ_LENS not specified then NIM_NUM_KV_CACHE_SEQ_LENS will automatically be set to 1. Otherwise if set to 1 will use value provided from NIM_NUM_KV_CACHE_SEQ_LENS. The recommended default for NIM LLM is for all GPUs to have >= 95% of memory free. Setting this variable to true overrides this default and will run the model regardless of memory constraints. It will also use heuristics to determine if GPU will likely meet or fail memory requirements and will provide a warning if applicable.

NIM_RELAX_MEM_CONSTRAINTS

must be set to 1 for this environment variable to take effect. Set to a value greater than or equal to 1 to override the default KV cache memory allocation settings for NIM LLM. The value provided will be used to determine how many maximum sequence lengths can fit within the KV cache (for example 2 or 3.75). The
maximum sequence length is the context size of the model.

then I tried:
docker run -it --rm --gpus all -e NGC_API_KEY -e NIM_MAX_MODEL_LEN=16000 -e NIM_RELAX_MEM_CONSTRAINTS=1 -e NIM_NUM_KV_CACHE_SEQ_LENS=1 -v “$LOCAL_NIM_CACHE:/opt/nim/.cache” -u $(id -u) -p 8002:8000 nvcr.io/nim/meta/llama-3.2-11b-vision-instruct:latest

but I still got the same behaviour

Then I tried to use python3 -m vllm_nvext.entrypoints.openai.api_server --max-num-seqs 1, which apparently worked, but in a weird way the model took more GPU memory to run, so I suspect by comparing the logs without replacing the entrypoint that with the override some steps were not executed when starting up the container

Is there a way to override the max_num_seqs?

Hey there,

Can you try using NIM_MAX_NUM_SEQS instead of NIM_NUM_KV_CACHE_SEQ_LENS according to these docs and see if that works?

It worked, thanks.

I didn’t realize that there is different documentation for vision models.

Thanks for letting us know it got resolved! Great news -have a great day!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.