Blueprint RAG v2.0.0

ok111 · April 21, 2025, 10:29am

Hello
Try to launch NVIDIA-AI-Blueprints/rag (tag v2.0.0)
Working on VMWare ESXI v8.0 + ESXi_8.0.0_Driver → Ubuntu 24.04 + Driver Version: 570.124.06 and CUDA Version: 12.8
nvidia-smi output:

# nvidia-smi
Mon Apr 21 10:23:38 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  GRID A100D-2-20C               On  |   00000000:02:00.0 Off |                   On |
| N/A   N/A    P0            N/A  /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   1  GRID A100D-2-20C               On  |   00000000:02:02.0 Off |                   On |
| N/A   N/A    P0            N/A  /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   2  GRID A100D-2-20C               On  |   00000000:02:03.0 Off |                   On |
| N/A   N/A    P0            N/A  /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   3  GRID A100D-2-20C               On  |   00000000:02:04.0 Off |                   On |
| N/A   N/A    P0            N/A  /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   4  GRID A100D-2-20C               On  |   00000000:02:05.0 Off |                   On |
| N/A   N/A    P0            N/A  /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   5  GRID A100D-2-20C               On  |   00000000:02:06.0 Off |                   On |
| N/A   N/A    P0            N/A  /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    0   0   0  |               1MiB / 18412MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    0   0   0  |               1MiB / 18412MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    0   0   0  |               1MiB / 18412MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  3    0   0   0  |               1MiB / 18412MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  4    0   0   0  |               1MiB / 18412MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  5    0   0   0  |               1MiB / 18412MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

# nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.17.5
commit: f785e908a7f72149f8912617058644fd84e38cde

Part of nims.yaml

services:
  nim-llm:
    container_name: nim-llm-ms
    image: nvcr.io/nim/meta/llama-3.1-70b-instruct-pb24h2:1.3.4
    volumes:
    - ${MODEL_DIRECTORY:-./}:/opt/nim/.cache
    user: "${USERID}"
    ports:
    - "8999:8000"
    expose:
    - "8000"
    security_opt:
      - label=disable
    environment:
      NGC_API_KEY: ${NVIDIA_API_KEY}
      CUDA_VERSION: "12.8.0"
      NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
      NVIDIA_VISIBLE_DEVICES: "all"
      NV_CUDA_CUDART_VERSION: "12.8.57-1"
    runtime: nvidia
    shm_size: 20gb
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: ${INFERENCE_GPU_COUNT:-all}
              #device_ids: ['${LLM_MS_GPU_ID:-2,3}']
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "python3", "-c", "import requests; requests.get('http://localhost:8000/v1/health/ready')"]
      interval: 10s
      timeout: 20s
      retries: 100
    profiles: ["", "rag"]

Env:

MODEL_DIRECTORY=/data/nvidia/.cache/model-cache
NVIDIA_API_KEY=nvapi-lxTkb......OILb

Nvidia container toolkit

# cat /etc/nvidia-container-runtime/config.toml 
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = true
supported-driver-capabilities = "all"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
no-cgroups = true
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:root"

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
log-level = "debug"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = true

[nvidia-ctk]
path = "nvidia-ctk"

Run and output:

# USERID=$(id -u) docker compose -f deploy/compose/nims.yaml up -d nim-llm
# docker logs -f nim-llm-ms 

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.3.3
Model: meta/llama-3.1-70b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

The NIM container is governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement) and the Product Specific Terms for AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products).

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement).

ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement, Built with Llama.

You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead. See https://pypi.org/project/pynvml for more information.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/nim/llm/nim_llm_sdk/entrypoints/launch.py", line 99, in <module>
    main()
  File "/opt/nim/llm/nim_llm_sdk/entrypoints/launch.py", line 42, in main
    inference_env = prepare_environment()
  File "/opt/nim/llm/nim_llm_sdk/entrypoints/args.py", line 204, in prepare_environment
    engine_args, extracted_name = inject_ngc_hub(engine_args)
  File "/opt/nim/llm/nim_llm_sdk/hub/ngc_injector.py", line 239, in inject_ngc_hub
    system = get_hardware_spec()
  File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 358, in get_hardware_spec
    device_mem_total, device_mem_free, device_mem_used, device_mem_reserved = gpus.device_mem(device_id)
  File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 198, in device_mem
    mem_data = pynvml.nvmlDeviceGetMemoryInfo(handle, version=pynvml.nvmlMemory_v2)
  File "/opt/nim/llm/.venv/lib/python3.10/site-packages/pynvml/nvml.py", line 2440, in nvmlDeviceGetMemoryInfo
    _nvmlCheckReturn(ret)
  File "/opt/nim/llm/.venv/lib/python3.10/site-packages/pynvml/nvml.py", line 833, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NoPermission: Insufficient Permissions

With image image: nvcr.io/nim/meta/llama-3.1-70b-instruct:1.8 - container starting, bat OOM kill it after some time

neal.vaidya · April 24, 2025, 5:32pm

Hi @ok111, I don’t believe you have enough GPU memory to run the 70b model. It requires roughly 2x70B bytes to run at fp16 (since A100 doesn’t support fp8), which is (roughly) 140GB. It looks like you have 5x20GiB, which is (roughly) 100GB – this would explain the OOM issue with the 1.8 container.

The pynvml issue likely has to do with the use of MIG – NIM is not supported on MIG GPUs, but we might have to do some more digging to determine the exact issue. Can you try running a smaller model (like one of the 8b models) and seeing what happens? What about different versions of the 8b model?

Topic		Replies	Views
NVIDIA NIM Container with CUDA out of Memory Problem Docker and NVIDIA Docker cuda , ubuntu , docker , nim , llama3-8b-instruct	2	532	September 20, 2024
RTX 4090 shows as "non-free GPU" when running NIM model in docker AI Foundation Models and Endpoints nim	8	2038	October 21, 2024
/opt/nim/start-server.sh: line 61: 32 Killed python3 -m vllm_nvext.entrypoints.openai.api_server Container: CUDA	0	266	July 9, 2024
How to fix 0 compatible profiles for L40S with mistral-7b-instruct-v03 NIM? Models gpu , nim , mistral-7b-instruct-v03	7	312	November 4, 2024
NIM embedding model downloads but fails with auth error on startup Access/Accounts nim , nv-embedqa-e5-v5	29	693	April 10, 2025
Unable to Run NIM on H100 GPU Due to Profile Compatibility Issue Despite Sufficient GPU Resources Models nim , llama-31-8b-instruct , llama	1	205	November 12, 2024
[SUPPORT] Workbench Example Project: Hybrid RAG NVIDIA AI Workbench workbench-example-project	105	2187	June 13, 2025
Model says there is a compatible profile but fails on data type Models nim , mistral-7b-instruct-v03	4	673	August 21, 2024
Batch processing using NVIDIA NIM \| Docker \| Self-hosted Models python , nim , llama3-8b-instruct , llama-31-8b-instruct , llama	11	288	January 29, 2025
Getting Started With NVIDIA NIM Tutorial Issues with NGC Registry Access/Accounts ubuntu , nim , llm , llama3-8b-instruct	7	1502	July 24, 2024

Blueprint RAG v2.0.0

Related topics