According to Support Matrix - NVIDIA Docs, the GPU should be compatible with running 7b models. But, the NIM indicates there are no compatible profiles.
lsb_release -a
Ubuntu 22.04.4 LTS
docker --version
Docker version 27.2.0, build 3ab4256
apt list --installed 2>&1|grep nvidia
libnvidia-container-tools/unknown,now 1.16.1-1 amd64 [installed,automatic]
libnvidia-container1/unknown,now 1.16.1-1 amd64 [installed,automatic]
nvidia-container-toolkit-base/unknown,now 1.16.1-1 amd64 [installed,automatic]
nvidia-container-toolkit/unknown,now 1.16.1-1 amd64 [installed]
nvidia-smi
Wed Sep 11 12:29:32 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40-48C Off | 00000000:02:00.0 Off | 0 |
| N/A N/A P8 N/A / N/A | 1MiB / 49152MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
export LOCAL_NIM_CACHE=~/.cache/nim3
mkdir -p "$LOCAL_NIM_CACHE"
chmod 777 "$LOCAL_NIM_CACHE"
export NGC_API_KEY=nvapi-E...
docker container run -it \
--rm \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest
===========================================
== NVIDIA Inference Microservice LLM NIM ==
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Apache 2.0 License (Apache License, Version 2.0).
2024-09-11 12:25:16,891 [INFO] PyTorch version 2.2.2 available.
2024-09-11 12:25:17,611 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-09-11 12:25:17,611 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-09-11 12:25:17,733 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 09-11 12:25:18.733 api_server.py:489] NIM LLM API version 1.0.0
INFO 09-11 12:25:18.735 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 09-11 12:25:18.735 ngc_profile.py:219] Detected 0 compatible profile(s).
INFO 09-11 12:25:18.735 ngc_profile.py:221] Detected additional 3 compatible profile(s) that are currently not runnable due to low free GPU memory.
ERROR 09-11 12:25:18.735 utils.py:21] Could not find a profile that is currently runnable with the detected hardware. Please check the system information below and make sure you have enough free GPUs.
SYSTEM INFO
- Free GPUs:
- Non-free GPUs:
- [26b5:10de] (0) NVIDIA L40-48C (L40S) [current utilization: 11%]
Here are the available profiles
docker container run -it \
--rm \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
nvcr.io/nim/mistralai/mistral-7b-instruct-v03:latest list-model-profiles
===========================================
== NVIDIA Inference Microservice LLM NIM ==
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Apache 2.0 License (Apache License, Version 2.0).
SYSTEM INFO
- Free GPUs:
- Non-free GPUs:
- [26b5:10de] (0) NVIDIA L40-48C (L40S) [current utilization: 11%]
MODEL PROFILES
- [26b5:10de] (0) NVIDIA L40-48C (L40S) [current utilization: 11%]
- Compatible with system and runnable:
- Compatible with system but not runnable due to low GPU free memory
- cc18942f40e770aa27a0b02c1f5bf1458a6fedd10a1ed377630d30d71a1b36db (tensorrt_llm-l40s-fp8-tp1-throughput)
- 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput)
- 7680b65db3dde6ebb3cb045e9176426b32d2e14023e61f1cd2137216dd1dc789 (vllm-fp16-tp1)
- With LoRA support:
- eb445d1e451ed3987ca36da9be6bb4cdd41e498344cbf477a1600198753883ff (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
- 114fc68ad2c150e37eb03a911152f342e4e7423d5efb769393d30fa0b0cd1f9e (vllm-fp16-tp1-lora)
- Incompatible with system:
- 48004baf4f45ca177aa94abfd3c5c54858808ad728914b1626c3cf038ea85bc4 (tensorrt_llm-h100-fp8-tp2-latency)
- 5c17c27186b232e834aee9c61d1f5db388874da40053d70b84fd1386421ff577 (tensorrt_llm-l40s-fp8-tp2-latency)
- 08ab4363f225c19e3785b58408fa4dcac472459cca1febcfaffb43f873557e87 (tensorrt_llm-h100-fp8-tp1-throughput)
- dea9af90d5311ff2d651db8c16f752d014053d3b1c550474cbeda241f81c96bd (tensorrt_llm-a100-fp16-tp2-latency)
- 6064ab4c33a1c6da8058422b8cb0347e72141d203c77ba309ce5c5533f548188 (tensorrt_llm-h100-fp16-tp2-latency)
- ef22c7cecbcf2c8b3889bd58a48095e47a8cc0394d221acda1b4087b46c6f3e9 (tensorrt_llm-l40s-fp16-tp2-latency)
- c79561a74f97b157de12066b7a137702a4b09f71f4273ff747efe060881fca92 (tensorrt_llm-a100-fp16-tp1-throughput)
- 8833b9eba1bd4fbed4f764e64797227adca32e3c1f630c2722a8a52fee2fd1fa (tensorrt_llm-h100-fp16-tp1-throughput)
- 7387979dae9c209b33010e5da9aae4a94f75d928639ba462201e88a5dd4ac185 (vllm-fp16-tp2)
- 2c57f0135f9c6de0c556ba37f43f55f6a6c0a25fe0506df73e189aedfbd8b333 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
- 8f9730e45a88fb2ac16ce2ce21d7460479da1fd8747ba32d2b92fc4f6140ba83 (tensorrt_llm-h100-fp16-tp1-throughput-lora)
- 5797a519e300612f87f8a4a50a496a840fa747f7801b2dcd0cc9a3b4b949dd92 (vllm-fp16-tp2-lora)
If I add these two options to the docker run, then it will run. But, I have been having issues with it stopping responding after an unknown amount of time. It will sometimes also hang starting up the NIM around the time it has allocated about 18GB of GPU memory. I can tell by watching nvidia-smi and when it hangs it is usually the same amount of allocated GPU memory being displayed
-e NIM_MANIFEST_ALLOW_UNSAFE=1 \
-e NIM_MODEL_PROFILE=95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a \
===========================================
== NVIDIA Inference Microservice LLM NIM ==
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/mistralai/mistral-7b-instruct-v03
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Apache 2.0 License (Apache License, Version 2.0).
2024-09-10 22:30:39,818 [INFO] PyTorch version 2.2.2 available.
2024-09-10 22:30:40,342 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-09-10 22:30:40,342 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-09-10 22:30:40,445 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 09-10 22:30:40.879 api_server.py:489] NIM LLM API version 1.0.0
INFO 09-10 22:30:40.881 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 09-10 22:30:40.881 ngc_profile.py:219] Detected 0 compatible profile(s).
INFO 09-10 22:30:40.881 ngc_profile.py:221] Detected additional 3 compatible profile(s) that are currently not runnable due to low free GPU memory.
INFO 09-10 22:30:40.881 ngc_injector.py:106] Valid profile: 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput) on GPUs
INFO 09-10 22:30:40.881 ngc_injector.py:141] Selected profile: 95f764b13dca98173068ad7dd9184098e18a04ad803722540a911d35a599378a (tensorrt_llm-l40s-fp16-tp1-throughput)
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: pp: 1
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: gpu_device: 26b5:10de
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: profile: throughput
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: tp: 1
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: gpu: L40S
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: llm_engine: tensorrt_llm
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 09-10 22:30:41.749 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 09-10 22:30:41.750 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 09-10 22:30:46.250 ngc_injector.py:172] Model workspace is now ready. It took 4.501 seconds
INFO 09-10 22:30:46.254 async_trtllm_engine.py:74] Initializing an LLM engine (v1.0.0) with config: model=‘/tmp/mistralai–mistral-7b-instruct-v0.3-_fpx2l7t’, speculative_config=None, tokenizer=‘/tmp/mistralai–mistral-7b-instruct-v0.3-_fpx2l7t’, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend=‘outlines’), seed=0)
WARNING 09-10 22:30:46.261 logging.py:329] You set add_prefix_space
. The tokenizer needs to be converted from the slow tokenizers
INFO 09-10 22:30:46.561 utils.py:201] Using 0 bytes of gpu memory for PEFT cache
INFO 09-10 22:30:46.561 utils.py:207] Engine size in bytes 14534527988
INFO 09-10 22:30:46.565 utils.py:211] available device memory 44047466496
INFO 09-10 22:30:46.566 utils.py:218] Setting free_gpu_memory_fraction to 0.9