How to fix 0 compatible profiles for L40S with mistral-7b-instruct-v03 NIM?

According to Support Matrix - NVIDIA Docs, the GPU should be compatible with running 7b models. But, the NIM indicates there are no compatible profiles.

lsb_release -a
Ubuntu 22.04.4 LTS

docker --version
Docker version 27.2.0, build 3ab4256

apt list --installed 2>&1|grep nvidia
libnvidia-container-tools/unknown,now 1.16.1-1 amd64 [installed,automatic]
libnvidia-container1/unknown,now 1.16.1-1 amd64 [installed,automatic]
nvidia-container-toolkit-base/unknown,now 1.16.1-1 amd64 [installed,automatic]
nvidia-container-toolkit/unknown,now 1.16.1-1 amd64 [installed]

nvidia-smi
Wed Sep 11 12:29:32 2024

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40-48C                 Off |   00000000:02:00.0 Off |                    0 |
| N/A   N/A    P8             N/A /  N/A  |       1MiB /  49152MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
export LOCAL_NIM_CACHE=~/.cache/nim3
mkdir -p "$LOCAL_NIM_CACHE"
chmod 777 "$LOCAL_NIM_CACHE"
export NGC_API_KEY=nvapi-E...
docker container run -it \
  --rm \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest

===========================================
== NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Apache 2.0 License (Apache License, Version 2.0).

2024-09-11 12:25:16,891 [INFO] PyTorch version 2.2.2 available.
2024-09-11 12:25:17,611 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-09-11 12:25:17,611 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-09-11 12:25:17,733 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 09-11 12:25:18.733 api_server.py:489] NIM LLM API version 1.0.0
INFO 09-11 12:25:18.735 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 09-11 12:25:18.735 ngc_profile.py:219] Detected 0 compatible profile(s).
INFO 09-11 12:25:18.735 ngc_profile.py:221] Detected additional 3 compatible profile(s) that are currently not runnable due to low free GPU memory.
ERROR 09-11 12:25:18.735 utils.py:21] Could not find a profile that is currently runnable with the detected hardware. Please check the system information below and make sure you have enough free GPUs.
SYSTEM INFO

  • Free GPUs:
  • Non-free GPUs:
    • [26b5:10de] (0) NVIDIA L40-48C (L40S) [current utilization: 11%]

Here are the available profiles

docker container run -it \
  --rm \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Apache 2.0 License (Apache License, Version 2.0).

SYSTEM INFO

  • Free GPUs:
  • Non-free GPUs:
    • [26b5:10de] (0) NVIDIA L40-48C (L40S) [current utilization: 11%]
      MODEL PROFILES
  • Compatible with system and runnable:
  • Compatible with system but not runnable due to low GPU free memory
    • cc18942f40e770aa27a0b02c1f5bf1458a6fedd10a1ed377630d30d71a1b36db (tensorrt_llm-l40s-fp8-tp1-throughput)
    • 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput)
    • 7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 (vllm-fp16-tp1)
    • With LoRA support:
      • eb445d1e451ed3987ca36da9be6bb4cdd41e498344cbf477a1600198753883ff (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
      • 114fc68ad2c150e37eb03a911152f342e4e7423d5efb769393d30fa0b0cd1f9e (vllm-fp16-tp1-lora)
  • Incompatible with system:
    • 48004baf4f45ca177aa94abfd3c5c54858808ad728914b1626c3cf038ea85bc4 (tensorrt_llm-h100-fp8-tp2-latency)
    • 5c17c27186b232e834aee9c61d1f5db388874da40053d70b84fd1386421ff577 (tensorrt_llm-l40s-fp8-tp2-latency)
    • 08ab4363f225c19e3785b58408fa4dcac472459cca1febcfaffb43f873557e87 (tensorrt_llm-h100-fp8-tp1-throughput)
    • dea9af90d5311ff2d651db8c16f752d014053d3b1c550474cbeda241f81c96bd (tensorrt_llm-a100-fp16-tp2-latency)
    • 6064ab4c33a1c6da8058422b8cb0347e72141d203c77ba309ce5c5533f548188 (tensorrt_llm-h100-fp16-tp2-latency)
    • ef22c7cecbcf2c8b3889bd58a48095e47a8cc0394d221acda1b4087b46c6f3e9 (tensorrt_llm-l40s-fp16-tp2-latency)
    • c79561a74f97b157de12066b7a137702a4b09f71f4273ff747efe060881fca92 (tensorrt_llm-a100-fp16-tp1-throughput)
    • 8833b9eba1bd4fbed4f764e64797227adca32e3c1f630c2722a8a52fee2fd1fa (tensorrt_llm-h100-fp16-tp1-throughput)
    • 7387979dae9c209b33010e5da9aae4a94f75d928639ba462201e88a5dd4ac185 (vllm-fp16-tp2)
    • 2c57f0135f9c6de0c556ba37f43f55f6a6c0a25fe0506df73e189aedfbd8b333 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
    • 8f9730e45a88fb2ac16ce2ce21d7460479da1fd8747ba32d2b92fc4f6140ba83 (tensorrt_llm-h100-fp16-tp1-throughput-lora)
    • 5797a519e300612f87f8a4a50a496a840fa747f7801b2dcd0cc9a3b4b949dd92 (vllm-fp16-tp2-lora)

If I add these two options to the docker run, then it will run. But, I have been having issues with it stopping responding after an unknown amount of time. It will sometimes also hang starting up the NIM around the time it has allocated about 18GB of GPU memory. I can tell by watching nvidia-smi and when it hangs it is usually the same amount of allocated GPU memory being displayed

  -e NIM_MANIFEST_ALLOW_UNSAFE=1 \
  -e NIM_MODEL_PROFILE=95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a \

===========================================
== NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Apache 2.0 License (Apache License, Version 2.0).

2024-09-10 22:30:39,818 [INFO] PyTorch version 2.2.2 available.
2024-09-10 22:30:40,342 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-09-10 22:30:40,342 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-09-10 22:30:40,445 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 09-10 22:30:40.879 api_server.py:489] NIM LLM API version 1.0.0
INFO 09-10 22:30:40.881 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 09-10 22:30:40.881 ngc_profile.py:219] Detected 0 compatible profile(s).
INFO 09-10 22:30:40.881 ngc_profile.py:221] Detected additional 3 compatible profile(s) that are currently not runnable due to low free GPU memory.
INFO 09-10 22:30:40.881 ngc_injector.py:106] Valid profile: 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput) on GPUs
INFO 09-10 22:30:40.881 ngc_injector.py:141] Selected profile: 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput)
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: pp: 1
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: gpu_device: 26b5:10de
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: profile: throughput
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: tp: 1
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: gpu: L40S
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: llm_engine: tensorrt_llm
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 09-10 22:30:41.750 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 09-10 22:30:46.250 ngc_injector.py:172] Model workspace is now ready. It took 4.501 seconds
INFO 09-10 22:30:46.254 async_trtllm_engine.py:74] Initializing an LLM engine (v1.0.0) with config: model=‘/tmp/mistralai–mistral-7b-instruct-v0.3-_fpx2l7t’, speculative_config=None, tokenizer=‘/tmp/mistralai–mistral-7b-instruct-v0.3-_fpx2l7t’, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend=‘outlines’), seed=0)
WARNING 09-10 22:30:46.261 logging.py:329] You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
INFO 09-10 22:30:46.561 utils.py:201] Using 0 bytes of gpu memory for PEFT cache
INFO 09-10 22:30:46.561 utils.py:207] Engine size in bytes 14534527988
INFO 09-10 22:30:46.565 utils.py:211] available device memory 44047466496
INFO 09-10 22:30:46.566 utils.py:218] Setting free_gpu_memory_fraction to 0.9

Hey @brooke.hedrick – NIM tries to deploy on free GPUs (i.e. utilization < 5%), and based on the logs you shared

SYSTEM INFO

Free GPUs:
Non-free GPUs:
[26b5:10de] (0) NVIDIA L40-48C (L40S) [current utilization: 11%]

There was something else running on the GPU. I’d recommend taking a look and seeing if there’s any other processes running on the GPU and to try removing them.

That is good to know. Would nvidia-smi showing 1MiB/49152Mib and 0% CPU make sense for the 11% math?

  1. I rebooted the server.
  2. docker ps
    CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
    –NOTHING HERE–
  3. nvidia-smi - my timezone is CT with DST (+5)
nvidia-smi
Wed Sep 11 17:42:16 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40-48C                 Off |   00000000:02:00.0 Off |                    0 |
| N/A   N/A    P8             N/A /  N/A  |       1MiB /  49152MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
  1. Then tried to start the NIM

Please note, this is on a VMWare VM using vGPU. This is the only VM attached to the GPU.

We have exactly the same issue. Any solution yet?

Hi @florian.boettcher1 ,

Short term, we have passed the GPU through to the VM instead of using vGPU with the drawbacks of doing so. This worked fine. If you are willing to try a 12b version of model, you have an option. I haven’t had time to try the 12b model under vGPU myself yet.

My updates from support
Oct 11, 2024, 12:15 PM
Hi Brooke,
Options
1. passthrough works mistral-7b-instruct-v03 yet (no plan update from nim engineering whether and when this model will have vGPU support)
2. There are some new models NIMs team mentioned that will have vGPU support, e.g. Mistral-nemo-please give a try.

Oct 1, 2024, 03:55 PM
The NIM team update that Mistral-nemo-12B-instruct is ready to use.

Sep 28, 2024, 12:31 PM
vGPU support is in 1.2 release.
Llama 3.1 70b, Mixtral 8x7B, Mixtral 8x22B models have been released.
Llama 3.1 8b will be released soon.

1 Like

Thanks for the message, but we managed to get it fixed today. In summary, it was the pcipassthru settings and disabling ECC for the GPUs on ESXi.

@florian.boettcher1 Are you using vGPU or just passing the GPU through directly to the VM? It sounds like the latter.

We are using vGPU because we need it for some usecases.