Blueprint RAG v2.0.0

Hello
Try to launch NVIDIA-AI-Blueprints/rag (tag v2.0.0)
Working on VMWare ESXI v8.0 + ESXi_8.0.0_Driver → Ubuntu 24.04 + Driver Version: 570.124.06 and CUDA Version: 12.8
nvidia-smi output:

# nvidia-smi
Mon Apr 21 10:23:38 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  GRID A100D-2-20C               On  |   00000000:02:00.0 Off |                   On |
| N/A   N/A    P0            N/A  /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   1  GRID A100D-2-20C               On  |   00000000:02:02.0 Off |                   On |
| N/A   N/A    P0            N/A  /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   2  GRID A100D-2-20C               On  |   00000000:02:03.0 Off |                   On |
| N/A   N/A    P0            N/A  /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   3  GRID A100D-2-20C               On  |   00000000:02:04.0 Off |                   On |
| N/A   N/A    P0            N/A  /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   4  GRID A100D-2-20C               On  |   00000000:02:05.0 Off |                   On |
| N/A   N/A    P0            N/A  /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   5  GRID A100D-2-20C               On  |   00000000:02:06.0 Off |                   On |
| N/A   N/A    P0            N/A  /  N/A  |       1MiB /  20480MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    0   0   0  |               1MiB / 18412MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    0   0   0  |               1MiB / 18412MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  2    0   0   0  |               1MiB / 18412MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  3    0   0   0  |               1MiB / 18412MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  4    0   0   0  |               1MiB / 18412MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  5    0   0   0  |               1MiB / 18412MiB    | 28      0 |  2   0    1    0    0 |
|                  |                 0MiB /  4096MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

# nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.17.5
commit: f785e908a7f72149f8912617058644fd84e38cde

Part of nims.yaml

services:
  nim-llm:
    container_name: nim-llm-ms
    image: nvcr.io/nim/meta/llama-3.1-70b-instruct-pb24h2:1.3.4
    volumes:
    - ${MODEL_DIRECTORY:-./}:/opt/nim/.cache
    user: "${USERID}"
    ports:
    - "8999:8000"
    expose:
    - "8000"
    security_opt:
      - label=disable
    environment:
      NGC_API_KEY: ${NVIDIA_API_KEY}
      CUDA_VERSION: "12.8.0"
      NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
      NVIDIA_VISIBLE_DEVICES: "all"
      NV_CUDA_CUDART_VERSION: "12.8.57-1"
    runtime: nvidia
    shm_size: 20gb
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: ${INFERENCE_GPU_COUNT:-all}
              #device_ids: ['${LLM_MS_GPU_ID:-2,3}']
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "python3", "-c", "import requests; requests.get('http://localhost:8000/v1/health/ready')"]
      interval: 10s
      timeout: 20s
      retries: 100
    profiles: ["", "rag"]

Env:

MODEL_DIRECTORY=/data/nvidia/.cache/model-cache
NVIDIA_API_KEY=nvapi-lxTkb......OILb

Nvidia container toolkit

# cat /etc/nvidia-container-runtime/config.toml 
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = true
supported-driver-capabilities = "all"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
no-cgroups = true
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:root"

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
log-level = "debug"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = true

[nvidia-ctk]
path = "nvidia-ctk"

Run and output:

# USERID=$(id -u) docker compose -f deploy/compose/nims.yaml up -d nim-llm
# docker logs -f nim-llm-ms 

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.3.3
Model: meta/llama-3.1-70b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

The NIM container is governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement) and the Product Specific Terms for AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products).

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement).

ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement, Built with Llama.

You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead. See https://pypi.org/project/pynvml for more information.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/nim/llm/nim_llm_sdk/entrypoints/launch.py", line 99, in <module>
    main()
  File "/opt/nim/llm/nim_llm_sdk/entrypoints/launch.py", line 42, in main
    inference_env = prepare_environment()
  File "/opt/nim/llm/nim_llm_sdk/entrypoints/args.py", line 204, in prepare_environment
    engine_args, extracted_name = inject_ngc_hub(engine_args)
  File "/opt/nim/llm/nim_llm_sdk/hub/ngc_injector.py", line 239, in inject_ngc_hub
    system = get_hardware_spec()
  File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 358, in get_hardware_spec
    device_mem_total, device_mem_free, device_mem_used, device_mem_reserved = gpus.device_mem(device_id)
  File "/opt/nim/llm/nim_llm_sdk/hub/hardware_inspect.py", line 198, in device_mem
    mem_data = pynvml.nvmlDeviceGetMemoryInfo(handle, version=pynvml.nvmlMemory_v2)
  File "/opt/nim/llm/.venv/lib/python3.10/site-packages/pynvml/nvml.py", line 2440, in nvmlDeviceGetMemoryInfo
    _nvmlCheckReturn(ret)
  File "/opt/nim/llm/.venv/lib/python3.10/site-packages/pynvml/nvml.py", line 833, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NoPermission: Insufficient Permissions

With image image: nvcr.io/nim/meta/llama-3.1-70b-instruct:1.8 - container starting, bat OOM kill it after some time

Hi @ok111, I don’t believe you have enough GPU memory to run the 70b model. It requires roughly 2x70B bytes to run at fp16 (since A100 doesn’t support fp8), which is (roughly) 140GB. It looks like you have 5x20GiB, which is (roughly) 100GB – this would explain the OOM issue with the 1.8 container.

The pynvml issue likely has to do with the use of MIG – NIM is not supported on MIG GPUs, but we might have to do some more digging to determine the exact issue. Can you try running a smaller model (like one of the 8b models) and seeing what happens? What about different versions of the 8b model?