How to fix 0 compatible profiles for L40S with mistral-7b-instruct-v03 NIM?

brooke.hedrick · September 11, 2024, 12:36pm

According to Support Matrix - NVIDIA Docs, the GPU should be compatible with running 7b models. But, the NIM indicates there are no compatible profiles.

lsb_release -a
Ubuntu 22.04.4 LTS

docker --version
Docker version 27.2.0, build 3ab4256

apt list --installed 2>&1|grep nvidia
libnvidia-container-tools/unknown,now 1.16.1-1 amd64 [installed,automatic]
libnvidia-container1/unknown,now 1.16.1-1 amd64 [installed,automatic]
nvidia-container-toolkit-base/unknown,now 1.16.1-1 amd64 [installed,automatic]
nvidia-container-toolkit/unknown,now 1.16.1-1 amd64 [installed]

nvidia-smi
Wed Sep 11 12:29:32 2024

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40-48C                 Off |   00000000:02:00.0 Off |                    0 |
| N/A   N/A    P8             N/A /  N/A  |       1MiB /  49152MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

export LOCAL_NIM_CACHE=~/.cache/nim3
mkdir -p "$LOCAL_NIM_CACHE"
chmod 777 "$LOCAL_NIM_CACHE"
export NGC_API_KEY=nvapi-E...
docker container run -it \
  --rm \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest

===========================================
== NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03

This NIM container is governed by the NVIDIA AI Product Agreement here:

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Apache 2.0 License (Apache License, Version 2.0).

2024-09-11 12:25:16,891 [INFO] PyTorch version 2.2.2 available.
2024-09-11 12:25:17,611 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-09-11 12:25:17,611 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-09-11 12:25:17,733 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 09-11 12:25:18.733 api_server.py:489] NIM LLM API version 1.0.0
INFO 09-11 12:25:18.735 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 09-11 12:25:18.735 ngc_profile.py:219] Detected 0 compatible profile(s).
INFO 09-11 12:25:18.735 ngc_profile.py:221] Detected additional 3 compatible profile(s) that are currently not runnable due to low free GPU memory.
ERROR 09-11 12:25:18.735 utils.py:21] Could not find a profile that is currently runnable with the detected hardware. Please check the system information below and make sure you have enough free GPUs.
SYSTEM INFO

Free GPUs:
Non-free GPUs:
- [26b5:10de] (0) NVIDIA L40-48C (L40S) [current utilization: 11%]

Here are the available profiles

docker container run -it \
  --rm \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest list-model-profiles

===========================================
== NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03

This NIM container is governed by the NVIDIA AI Product Agreement here:

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Apache 2.0 License (Apache License, Version 2.0).

SYSTEM INFO

Free GPUs:
Non-free GPUs:
- [26b5:10de] (0) NVIDIA L40-48C (L40S) [current utilization: 11%]
  MODEL PROFILES
Compatible with system and runnable:
Compatible with system but not runnable due to low GPU free memory
- cc18942f40e770aa27a0b02c1f5bf1458a6fedd10a1ed377630d30d71a1b36db (tensorrt_llm-l40s-fp8-tp1-throughput)
- 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput)
- 7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 (vllm-fp16-tp1)
- With LoRA support:
  - eb445d1e451ed3987ca36da9be6bb4cdd41e498344cbf477a1600198753883ff (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
  - 114fc68ad2c150e37eb03a911152f342e4e7423d5efb769393d30fa0b0cd1f9e (vllm-fp16-tp1-lora)
Incompatible with system:
- 48004baf4f45ca177aa94abfd3c5c54858808ad728914b1626c3cf038ea85bc4 (tensorrt_llm-h100-fp8-tp2-latency)
- 5c17c27186b232e834aee9c61d1f5db388874da40053d70b84fd1386421ff577 (tensorrt_llm-l40s-fp8-tp2-latency)
- 08ab4363f225c19e3785b58408fa4dcac472459cca1febcfaffb43f873557e87 (tensorrt_llm-h100-fp8-tp1-throughput)
- dea9af90d5311ff2d651db8c16f752d014053d3b1c550474cbeda241f81c96bd (tensorrt_llm-a100-fp16-tp2-latency)
- 6064ab4c33a1c6da8058422b8cb0347e72141d203c77ba309ce5c5533f548188 (tensorrt_llm-h100-fp16-tp2-latency)
- ef22c7cecbcf2c8b3889bd58a48095e47a8cc0394d221acda1b4087b46c6f3e9 (tensorrt_llm-l40s-fp16-tp2-latency)
- c79561a74f97b157de12066b7a137702a4b09f71f4273ff747efe060881fca92 (tensorrt_llm-a100-fp16-tp1-throughput)
- 8833b9eba1bd4fbed4f764e64797227adca32e3c1f630c2722a8a52fee2fd1fa (tensorrt_llm-h100-fp16-tp1-throughput)
- 7387979dae9c209b33010e5da9aae4a94f75d928639ba462201e88a5dd4ac185 (vllm-fp16-tp2)
- 2c57f0135f9c6de0c556ba37f43f55f6a6c0a25fe0506df73e189aedfbd8b333 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
- 8f9730e45a88fb2ac16ce2ce21d7460479da1fd8747ba32d2b92fc4f6140ba83 (tensorrt_llm-h100-fp16-tp1-throughput-lora)
- 5797a519e300612f87f8a4a50a496a840fa747f7801b2dcd0cc9a3b4b949dd92 (vllm-fp16-tp2-lora)

If I add these two options to the docker run, then it will run. But, I have been having issues with it stopping responding after an unknown amount of time. It will sometimes also hang starting up the NIM around the time it has allocated about 18GB of GPU memory. I can tell by watching nvidia-smi and when it hangs it is usually the same amount of allocated GPU memory being displayed

  -e NIM_MANIFEST_ALLOW_UNSAFE=1 \
  -e NIM_MODEL_PROFILE=95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a \

===========================================
== NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03

This NIM container is governed by the NVIDIA AI Product Agreement here:

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Apache 2.0 License (Apache License, Version 2.0).

2024-09-10 22:30:39,818 [INFO] PyTorch version 2.2.2 available.
2024-09-10 22:30:40,342 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-09-10 22:30:40,342 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-09-10 22:30:40,445 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 09-10 22:30:40.879 api_server.py:489] NIM LLM API version 1.0.0
INFO 09-10 22:30:40.881 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 09-10 22:30:40.881 ngc_profile.py:219] Detected 0 compatible profile(s).
INFO 09-10 22:30:40.881 ngc_profile.py:221] Detected additional 3 compatible profile(s) that are currently not runnable due to low free GPU memory.
INFO 09-10 22:30:40.881 ngc_injector.py:106] Valid profile: 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput) on GPUs
INFO 09-10 22:30:40.881 ngc_injector.py:141] Selected profile: 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput)
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: pp: 1
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: gpu_device: 26b5:10de
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: profile: throughput
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: tp: 1
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: gpu: L40S
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: llm_engine: tensorrt_llm
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 09-10 22:30:41.750 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 09-10 22:30:46.250 ngc_injector.py:172] Model workspace is now ready. It took 4.501 seconds
INFO 09-10 22:30:46.254 async_trtllm_engine.py:74] Initializing an LLM engine (v1.0.0) with config: model=‘/tmp/mistralai–mistral-7b-instruct-v0.3-_fpx2l7t’, speculative_config=None, tokenizer=‘/tmp/mistralai–mistral-7b-instruct-v0.3-_fpx2l7t’, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend=‘outlines’), seed=0)
WARNING 09-10 22:30:46.261 logging.py:329] You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
INFO 09-10 22:30:46.561 utils.py:201] Using 0 bytes of gpu memory for PEFT cache
INFO 09-10 22:30:46.561 utils.py:207] Engine size in bytes 14534527988
INFO 09-10 22:30:46.565 utils.py:211] available device memory 44047466496
INFO 09-10 22:30:46.566 utils.py:218] Setting free_gpu_memory_fraction to 0.9

neal.vaidya · September 11, 2024, 5:26pm

Hey @brooke.hedrick – NIM tries to deploy on free GPUs (i.e. utilization < 5%), and based on the logs you shared

SYSTEM INFO

Free GPUs:
Non-free GPUs:
[26b5:10de] (0) NVIDIA L40-48C (L40S) [current utilization: 11%]

There was something else running on the GPU. I’d recommend taking a look and seeing if there’s any other processes running on the GPU and to try removing them.

brooke.hedrick · September 11, 2024, 5:43pm

That is good to know. Would nvidia-smi showing 1MiB/49152Mib and 0% CPU make sense for the 11% math?

I rebooted the server.
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
–NOTHING HERE–
nvidia-smi - my timezone is CT with DST (+5)

nvidia-smi
Wed Sep 11 17:42:16 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40-48C                 Off |   00000000:02:00.0 Off |                    0 |
| N/A   N/A    P8             N/A /  N/A  |       1MiB /  49152MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Then tried to start the NIM

Please note, this is on a VMWare VM using vGPU. This is the only VM attached to the GPU.

florian.boettcher1 · October 29, 2024, 10:34am

We have exactly the same issue. Any solution yet?

brooke.hedrick · November 4, 2024, 9:39pm

Hi @florian.boettcher1 ,

Short term, we have passed the GPU through to the VM instead of using vGPU with the drawbacks of doing so. This worked fine. If you are willing to try a 12b version of model, you have an option. I haven’t had time to try the 12b model under vGPU myself yet.

My updates from support
Oct 11, 2024, 12:15 PM
Hi Brooke,
Options
1. passthrough works mistral-7b-instruct-v03 yet (no plan update from nim engineering whether and when this model will have vGPU support)
2. There are some new models NIMs team mentioned that will have vGPU support, e.g. Mistral-nemo-please give a try.

Oct 1, 2024, 03:55 PM
The NIM team update that Mistral-nemo-12B-instruct is ready to use.

Sep 28, 2024, 12:31 PM
vGPU support is in 1.2 release.
Llama 3.1 70b, Mixtral 8x7B, Mixtral 8x22B models have been released.
Llama 3.1 8b will be released soon.

florian.boettcher1 · November 4, 2024, 9:45pm

Thanks for the message, but we managed to get it fixed today. In summary, it was the pcipassthru settings and disabling ECC for the GPUs on ESXi.

brooke.hedrick · November 4, 2024, 10:08pm

@florian.boettcher1 Are you using vGPU or just passing the GPU through directly to the VM? It sounds like the latter.

florian.boettcher1 · November 4, 2024, 10:11pm

We are using vGPU because we need it for some usecases.

Topic		Replies	Views
Model says there is a compatible profile but fails on data type Models nim , mistral-7b-instruct-v03	4	260	August 21, 2024
RTX 4090 shows as "non-free GPU" when running NIM model in docker AI Foundation Models and Endpoints nim	8	1102	October 21, 2024
NVIDIA NIM Container with CUDA out of Memory Problem Docker and NVIDIA Docker cuda , ubuntu , docker , nim , llama3-8b-instruct	2	245	September 20, 2024
/opt/nim/start-server.sh: line 61: 32 Killed python3 -m vllm_nvext.entrypoints.openai.api_server Container: CUDA	0	181	July 9, 2024
Nvidia-smi not recognizing Titan V Linux	20	7274	October 14, 2021
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte Models nim , mistral-7b-instruct-v03	0	16	November 12, 2024
Getting Started With NVIDIA NIM Tutorial Issues with NGC Registry Access/Accounts ubuntu , nim , llm , llama3-8b-instruct	7	463	July 24, 2024
WSL Modulus Docker run error (libnvidia-ml.so.1: file exists: unknown.) Technical Support (Modulus Only)	7	5811	June 12, 2023
Broken GPU state query failure in AMD + H100 Confidential Computing	10	883	February 15, 2024
Nvidia-smi "No devices were found" - VMWare ESXI Ubuntu Server 20.04.03 with RTX3070 Linux cuda , ubuntu , driver , nvidia-smi , linux-driver-solutions	48	30650	August 25, 2024

How to fix 0 compatible profiles for L40S with mistral-7b-instruct-v03 NIM?

=========================================== == NVIDIA Inference Microservice LLM NIM ==

=========================================== == NVIDIA Inference Microservice LLM NIM ==

=========================================== == NVIDIA Inference Microservice LLM NIM ==

Related topics

===========================================
== NVIDIA Inference Microservice LLM NIM ==

===========================================
== NVIDIA Inference Microservice LLM NIM ==

===========================================
== NVIDIA Inference Microservice LLM NIM ==