NVIDIA NIM Container with CUDA out of Memory Problem

andyq · July 25, 2024, 5:51pm

Hello All,

I am following the tutorial here https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#launch-nvidia-nim-for-llms to setup a NIM Environment.

My system is a Ubuntu 22.04 with a Nvidia A16 GPU, and here is a picture of my GPU Usage:

When I am running the Docker container for a llama3-7B-Instruct Model, I ran into a CUDA out Memory Issue:

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2024-07-25 17:47:52,811 [INFO] PyTorch version 2.2.2 available.
2024-07-25 17:47:53,349 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-07-25 17:47:53,349 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-07-25 17:47:53,452 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 07-25 17:47:54.348 api_server.py:489] NIM LLM API version 1.0.0
INFO 07-25 17:47:54.350 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 07-25 17:47:54.350 ngc_profile.py:219] Detected 1 compatible profile(s).
INFO 07-25 17:47:54.350 ngc_injector.py:106] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0]
INFO 07-25 17:47:54.350 ngc_injector.py:141] Selected profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
INFO 07-25 17:47:54.644 ngc_injector.py:146] Profile metadata: llm_engine: vllm
INFO 07-25 17:47:54.644 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 07-25 17:47:54.644 ngc_injector.py:146] Profile metadata: tp: 1
INFO 07-25 17:47:54.644 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 07-25 17:47:54.644 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 07-25 17:47:56.911 ngc_injector.py:172] Model workspace is now ready. It took 2.267 seconds
INFO 07-25 17:47:56.915 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-reo4cam9', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-reo4cam9', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 07-25 17:47:57.231 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-25 17:47:57.250 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
INFO 07-25 17:47:59 selector.py:28] Using FlashAttention backend.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module>
    engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
    engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 365, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 323, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 148, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 22, in _init_executor
    self._init_non_spec_worker()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 51, in _init_non_spec_worker
    self.driver_worker.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 114, in load_model
    self.model_runner.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 162, in load_model
    self.model = get_model(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
    return loader.load_model(model_config=model_config,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 227, in load_model
    model = _initialize_model(model_config, self.load_config,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 93, in _initialize_model
    return model_class(config=model_config.hf_config,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 338, in __init__
    self.lm_head = ParallelLMHead(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 132, in __init__
    super().__init__(num_embeddings, embedding_dim, params_dtype,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 74, in __init__
    torch.empty(self.num_embeddings_per_partition,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1002.00 MiB. GPU 0 has a total capacity of 14.53 GiB of which 150.00 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 13.98 GiB is allocated by PyTorch, and 19.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I checked the compatible image profile which showed me there is one available as shown here:

So I am not sure why I ran into this issue.

If anyone knows what is happening, please let me know.

Thanks

patel.sam · September 9, 2024, 12:11pm

Seeing same behavior. Did you ever figure this out?

pavlo3 · September 20, 2024, 9:58am

Same problem with RTX 3060 Ti, Ubuntu 22.04, Driver 550, CUDA 12.4
Card memory is not used at all.
With 6+ GB available getting this error
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU

Topic		Replies	Views
CUDA fail start. Local NIM Containers run failed CUDA Setup and Installation nim , llama-31-405b-instruct , llama	2	314	September 20, 2024
Issues while starting NIM container in A10 VM Models nim , llama3-8b-instruct	4	253	September 4, 2024
NIM Llama3 8B Instruct - Running container with "CUDA_ERROR_NO_DEVICE" cuDNN docker , nim , llama3-8b-instruct	1	118	March 28, 2025
/opt/nim/start-server.sh: line 61: 32 Killed python3 -m vllm_nvext.entrypoints.openai.api_server Container: CUDA	0	339	July 9, 2024
Blueprint RAG v2.0.0 NVIDIA Blueprints nim , llama-31-70b-instruct , llama , blueprints	1	206	April 24, 2025
Unable to build docker container for Llama-3.2-NV-EmbedQA-1B-v2 Models nim , llama	1	135	July 23, 2025
Launch the Reranker NIM : Failing to create container for Visual AI Agent nim , llama	9	298	May 23, 2025
Nemollm-inference-microservice failed to deploy Models nim , llama3-8b-instruct , llama	1	249	October 22, 2024
Unable to Run NIM on H100 GPU Due to Profile Compatibility Issue Despite Sufficient GPU Resources Models nim , llama-31-8b-instruct , llama	1	329	November 12, 2024
NIM - Llama 3 8B Instruct - Results were very weirdn Models nim	1	442	August 27, 2024

NVIDIA NIM Container with CUDA out of Memory Problem

Related topics