I have the following RTX 8000, 48 GB with only GPU.
I was able to run the model reducing the context to 16000, using the env var -e NIM_MAX_MODEL_LEN=16000, but when container is starting, it shows my this kind of stuff:
ERROR 2025-01-24 19:27:36.16 llm_engine.py:507] CUDA out of memory error: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 0 has a total capacity of 47.45 GiB of which 413.31 MiB is free. Process 2212352 has 47.05 GiB memory in use. Of the allocated memory 46.68 GiB is allocated by PyTorch, and 178.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.5 documentation). Reducing max_num_seqs by half: 128.
It goes till value 32 and then it starts the container.
I´m just testing so I tried to override max_num_seqs with value 1, reading the documentation I found this:
NIM_RELAX_MEM_CONSTRAINTS
If set to 1 and NIM_NUM_KV_CACHE_SEQ_LENS not specified then NIM_NUM_KV_CACHE_SEQ_LENS will automatically be set to 1. Otherwise if set to 1 will use value provided from NIM_NUM_KV_CACHE_SEQ_LENS. The recommended default for NIM LLM is for all GPUs to have >= 95% of memory free. Setting this variable to true overrides this default and will run the model regardless of memory constraints. It will also use heuristics to determine if GPU will likely meet or fail memory requirements and will provide a warning if applicable.
NIM_RELAX_MEM_CONSTRAINTS
must be set to 1 for this environment variable to take effect. Set to a value greater than or equal to 1 to override the default KV cache memory allocation settings for NIM LLM. The value provided will be used to determine how many maximum sequence lengths can fit within the KV cache (for example 2 or 3.75). The
maximum sequence length is the context size of the model.
then I tried:
docker run -it --rm --gpus all -e NGC_API_KEY -e NIM_MAX_MODEL_LEN=16000 -e NIM_RELAX_MEM_CONSTRAINTS=1 -e NIM_NUM_KV_CACHE_SEQ_LENS=1 -v “$LOCAL_NIM_CACHE:/opt/nim/.cache” -u $(id -u) -p 8002:8000 nvcr.io/nim/meta/llama-3.2-11b-vision-instruct:latest
but I still got the same behaviour
Then I tried to use python3 -m vllm_nvext.entrypoints.openai.api_server --max-num-seqs 1, which apparently worked, but in a weird way the model took more GPU memory to run, so I suspect by comparing the logs without replacing the entrypoint that with the override some steps were not executed when starting up the container
Is there a way to override the max_num_seqs?