Hello All,
I am following the tutorial here https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#launch-nvidia-nim-for-llms to setup a NIM Environment.
My system is a Ubuntu 22.04 with a Nvidia A16 GPU, and here is a picture of my GPU Usage:
When I am running the Docker container for a llama3-7B-Instruct Model, I ran into a CUDA out Memory Issue:
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
2024-07-25 17:47:52,811 [INFO] PyTorch version 2.2.2 available.
2024-07-25 17:47:53,349 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-07-25 17:47:53,349 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-07-25 17:47:53,452 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 07-25 17:47:54.348 api_server.py:489] NIM LLM API version 1.0.0
INFO 07-25 17:47:54.350 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 07-25 17:47:54.350 ngc_profile.py:219] Detected 1 compatible profile(s).
INFO 07-25 17:47:54.350 ngc_injector.py:106] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0]
INFO 07-25 17:47:54.350 ngc_injector.py:141] Selected profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
INFO 07-25 17:47:54.644 ngc_injector.py:146] Profile metadata: llm_engine: vllm
INFO 07-25 17:47:54.644 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 07-25 17:47:54.644 ngc_injector.py:146] Profile metadata: tp: 1
INFO 07-25 17:47:54.644 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 07-25 17:47:54.644 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 07-25 17:47:56.911 ngc_injector.py:172] Model workspace is now ready. It took 2.267 seconds
INFO 07-25 17:47:56.915 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-reo4cam9', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-reo4cam9', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 07-25 17:47:57.231 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-25 17:47:57.250 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
INFO 07-25 17:47:59 selector.py:28] Using FlashAttention backend.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module>
engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 365, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 323, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 148, in __init__
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 22, in _init_executor
self._init_non_spec_worker()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 51, in _init_non_spec_worker
self.driver_worker.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 114, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 162, in load_model
self.model = get_model(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
return loader.load_model(model_config=model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 227, in load_model
model = _initialize_model(model_config, self.load_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 93, in _initialize_model
return model_class(config=model_config.hf_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 338, in __init__
self.lm_head = ParallelLMHead(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 132, in __init__
super().__init__(num_embeddings, embedding_dim, params_dtype,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 74, in __init__
torch.empty(self.num_embeddings_per_partition,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 77, in __torch_function__
return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1002.00 MiB. GPU 0 has a total capacity of 14.53 GiB of which 150.00 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 13.98 GiB is allocated by PyTorch, and 19.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
I checked the compatible image profile which showed me there is one available as shown here:
So I am not sure why I ran into this issue.
If anyone knows what is happening, please let me know.
Thanks

