RTX 4090 shows as "non-free GPU" when running NIM model in docker

daniel.brosnan · June 10, 2024, 8:20am

I have been working to run NIM directly on my system with a single RTX4090. After some issues getting an API token to work, I am able to authenticate and pull the models. However, it detects 0 compatible profiles and defines my GPU as non-free. Has anyone successfully run NIM models natively on their PC with a single 4090? It has been a days-long challenge for me and still not quite there.

{USER REDACTED}:~$ export NGC_API_KEY={REDACTED}

export LOCAL_NIM_CACHE=~/.cache/nim

mkdir -p “$LOCAL_NIM_CACHE”

docker run -it --rm \

–gpus device=0 \

–shm-size=16GB \

-e NGC_API_KEY \

-v “$LOCAL_NIM_CACHE:/opt/nim/.cache” \

-u $(id -u) \

-p 8000:8000 \

nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

===========================================

== NVIDIA Inference Microservice LLM NIM ==

===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0

Model: nim/meta/llama3-8b-instruct

This NIM container is governed by the NVIDIA AI Product Agreement here:

NVIDIA Agreements | Enterprise Software | Product Specific Terms for AI Product.

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License

here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.

A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2024-06-10 08:00:16,668 [INFO] PyTorch version 2.2.2 available.

2024-06-10 08:00:17,046 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error

2024-06-10 08:00:17,046 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.

2024-06-10 08:00:17,117 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.

[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000

INFO 06-10 08:00:17.664 api_server.py:489] NIM LLM API version 1.0.0

INFO 06-10 08:00:17.665 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.

INFO 06-10 08:00:17.665 ngc_profile.py:219] Detected 0 compatible profile(s).

INFO 06-10 08:00:17.665 ngc_profile.py:221] Detected additional 1 compatible profile(s) that are currently not runnable due to low free GPU memory.

ERROR 06-10 08:00:17.665 utils.py:21] Could not find a profile that is currently runnable with the detected hardware. Please check the system information below and make sure you have enough free GPUs.

SYSTEM INFO

- Free GPUs:

- Non-free GPUs:

- [2684:10de] (0) NVIDIA GeForce RTX 4090 [current utilization: 10%]

egidio.desalve · June 21, 2024, 7:28am

Hi Daniel, I have the exact same problem on a VM (vmware) running ubuntu 22.04 with access to a GRID A100D-80C gpu. Did you solve it?

rtxyz · June 23, 2024, 4:19pm

I’m seeing the same issue. I am trying to run the llama3-8b-instruct NVIDIA NIM on Windows 11 with WSL and I’m getting the same error messages about Non-free GPUs. I tried restarting my PC but that did not resolve the issue. Were you able to find a solution for this?

rtxyz · June 23, 2024, 5:02pm

@daniel.brosnan @egidio.desalve I was able to get this to work after going into the EUFI/BIOS and enabling CPU Graphics (integrated graphics)

shess1 · July 18, 2024, 3:10pm

Hi Egidio, I have the exact same problem, also on a VM (vmware) running ubuntu 22.04 but with a H100-40C. Could you solve the problem in the meantime? I do not find any other resources related to this problem anywhere else. Thank you and best wishes, Simon

egidio.desalve · July 22, 2024, 2:02pm

Hi Shess, as you can see from this issue I opened:

github.com/NVIDIA/TensorRT-LLM

trtllm-build on GRID vGPU - nvml errors

opened 07:05AM - 28 Jun 24 UTC

edesalve

bug Investigating

### System Info - Host: VMware ESXi 7 - Host Nvidia drivers: 550.54.16 - VM C…PU architecture: x86_64 - VM Nvidia drivers: 550.54.15 - VM OS: Ubuntu LTS 22.04 - Physical GPU: A100 - TensorRT-LLM version: v0.11.0 ### Who can help? @byshiue ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks - [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction After proper checkpoint creation: ``` trtllm-build --checkpoint_dir Meta-Llama-3-8B-Instruct-bf16-ckpt --gemm_plugin bfloat16 --gpt_attention_plugin bfloat16 --output_dir Meta-Llama-3-8B-Instruct-bf16-engine ``` ### Expected behavior Successful build of the engine. ### actual behavior ``` [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024062500 [06/28/2024-06:35:05] [TRT-LLM] [I] Set bert_attention_plugin to auto. [06/28/2024-06:35:05] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16. [06/28/2024-06:35:05] [TRT-LLM] [I] Set gemm_plugin to bfloat16. [06/28/2024-06:35:05] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [06/28/2024-06:35:05] [TRT-LLM] [I] Set nccl_plugin to auto. [06/28/2024-06:35:05] [TRT-LLM] [I] Set lookup_plugin to None. [06/28/2024-06:35:05] [TRT-LLM] [I] Set lora_plugin to None. [06/28/2024-06:35:05] [TRT-LLM] [I] Set moe_plugin to auto. [06/28/2024-06:35:05] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [06/28/2024-06:35:05] [TRT-LLM] [I] Set context_fmha to True. [06/28/2024-06:35:05] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [06/28/2024-06:35:05] [TRT-LLM] [I] Set paged_kv_cache to True. [06/28/2024-06:35:05] [TRT-LLM] [I] Set remove_input_padding to True. [06/28/2024-06:35:05] [TRT-LLM] [I] Set use_custom_all_reduce to True. [06/28/2024-06:35:05] [TRT-LLM] [I] Set reduce_fusion to False. [06/28/2024-06:35:05] [TRT-LLM] [I] Set multi_block_mode to False. [06/28/2024-06:35:05] [TRT-LLM] [I] Set enable_xqa to True. [06/28/2024-06:35:05] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [06/28/2024-06:35:05] [TRT-LLM] [I] Set tokens_per_block to 64. [06/28/2024-06:35:05] [TRT-LLM] [I] Set use_paged_context_fmha to False. [06/28/2024-06:35:05] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [06/28/2024-06:35:05] [TRT-LLM] [I] Set multiple_profiles to False. [06/28/2024-06:35:05] [TRT-LLM] [I] Set paged_state to True. [06/28/2024-06:35:05] [TRT-LLM] [I] Set streamingllm to False. [06/28/2024-06:35:05] [TRT-LLM] [I] Compute capability: (8, 0) [06/28/2024-06:35:05] [TRT-LLM] [I] SM count: 98 Traceback (most recent call last): File "/home/s2e/.local/bin/trtllm-build", line 8, in <module> sys.exit(main()) File "/home/s2e/.local/lib/python3.10/site-packages/tensorrt_llm/commands/buil d.py", line 428, in main cluster_config = infer_cluster_config() File "/home/s2e/.local/lib/python3.10/site-packages/tensorrt_llm/auto_parallel /cluster_info.py", line 538, in infer_cluster_config cluster_info=infer_cluster_info(), File "/home/s2e/.local/lib/python3.10/site-packages/tensorrt_llm/auto_parallel /cluster_info.py", line 460, in infer_cluster_info sm_clock = pynvml.nvmlDeviceGetMaxClockInfo( File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 2182, in n vmlDeviceGetMaxClockInfo _nvmlCheckReturn(ret) File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 833, in _n vmlCheckReturn raise NVMLError(ret) pynvml.nvml.NVMLError_NotSupported: Not Supported ``` ### additional notes The same procedure has already been successfully completed on a VM with GPU `Passthrough`. The configuration of the VM with vGPU technology has been done correctly (details of the installed drivers at the bottom). The problem arises during the execution of `infer_cluster_info`. I created a small test in python to go and check all the requests made in the function: ```python import pynvml import torch import sys # Initialize NVML pynvml.nvmlInit() def test_nvml_calls(): device_index = torch.cuda.current_device() handle = pynvml.nvmlDeviceGetHandleByIndex(device_index) # List of NVML functions to test nvml_functions = [ ("nvmlDeviceGetCudaComputeCapability", lambda: pynvml.nvmlDeviceGetCudaComputeCapability(handle)), ("nvmlDeviceGetMaxClockInfo for SM Clock", lambda: pynvml.nvmlDeviceGetMaxClockInfo(handle, pynvml.NVML_CLOCK_SM)), ("nvmlDeviceGetMaxClockInfo for Mem Clock", lambda: pynvml.nvmlDeviceGetMaxClockInfo(handle, pynvml.NVML_CLOCK_MEM)), ("nvmlDeviceGetMemoryBusWidth", lambda: pynvml.nvmlDeviceGetMemoryBusWidth(handle)), ("nvmlDeviceGetNvLinkState", lambda: pynvml.nvmlDeviceGetNvLinkState(handle, 0)), ("nvmlDeviceGetNvLinkVersion", lambda: pynvml.nvmlDeviceGetNvLinkVersion(handle, 0)), ("nvmlDeviceGetCurrPcieLinkGeneration", lambda: pynvml.nvmlDeviceGetCurrPcieLinkGeneration(handle)), ("nvmlDeviceGetCurrPcieLinkWidth", lambda: pynvml.nvmlDeviceGetCurrPcieLinkWidth(handle)), ] results = {} for func_name, func in nvml_functions: try: result = func() results[func_name] = ("Success", result) except pynvml.NVMLError as e: results[func_name] = ("Failed", str(e)) return results # Run the tests and print the results results = test_nvml_calls() for func_name, (status, output) in results.items(): print(f"{func_name}: {status} - Output/Error: {output}") # Finalize NVML pynvml.nvmlShutdown() ``` and this is the output: ``` nvmlDeviceGetCudaComputeCapability: Success - Output/Error: (8, 0) nvmlDeviceGetMaxClockInfo for SM Clock: Failed - Output/Error: Not Supported nvmlDeviceGetMaxClockInfo for Mem Clock: Failed - Output/Error: Not Supported nvmlDeviceGetMemoryBusWidth: Success - Output/Error: 5120 nvmlDeviceGetNvLinkState: Failed - Output/Error: Not Supported nvmlDeviceGetNvLinkVersion: Failed - Output/Error: Not Supported nvmlDeviceGetCurrPcieLinkGeneration: Failed - Output/Error: Not Supported nvmlDeviceGetCurrPcieLinkWidth: Failed - Output/Error: Not Supported ``` This despite setting `pciPassthru0.cfg.enable_profiling` for the VM as suggested in the NVIDIA AI Enterprise User Guide. Is there something I'm missing or are vGPU simply not supported? ``` nvidia-smi -q ==============NVSMI LOG============== Timestamp : Fri Jun 28 07:03:27 2024 Driver Version : 550.54.15 CUDA Version : 12.4 Attached GPUs : 1 GPU 00000000:02:00.0 Product Name : GRID A100D-7-80C Product Brand : NVIDIA Virtual Compute Server Product Architecture : Ampere Display Mode : Enabled Display Active : Disabled Persistence Mode : Enabled Addressing Mode : None MIG Mode Current : Enabled Pending : Enabled MIG Device Index : 0 GPU Instance ID : 0 Compute Instance ID : 0 Device Attributes Shared Multiprocessor count : 98 Copy Engine count : 7 Encoder count : 0 Decoder count : 5 OFA count : 1 JPG count : 1 ECC Errors Volatile SRAM Uncorrectable : 0 FB Memory Usage Total : 76011 MiB Reserved : 0 MiB Used : 0 MiB Free : 76011 MiB BAR1 Memory Total : 4096 MiB Used : 0 MiB Free : 4095 MiB Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-be04bb87-20dd-11b2-bae1-30170d8796ab Minor Number : 0 VBIOS Version : 00.00.00.00.00 MultiGPU Board : No Board ID : 0x200 Board Part Number : N/A GPU Part Number : 20B5-893-A1 FRU Part Number : N/A Module ID : N/A Inforom Version Image Version : N/A OEM Object : N/A ECC Object : N/A Power Management Object : N/A Inforom BBX Object Flush Latest Timestamp : N/A Latest Duration : N/A GPU Operation Mode Current : N/A Pending : N/A GPU C2C Mode : N/A GPU Virtualization Mode Virtualization Mode : VGPU Host VGPU Mode : N/A vGPU Heterogeneous Mode : N/A vGPU Software Licensed Product Product Name : NVIDIA Virtual Compute Server License Status : Licensed (Expiry: 2024-6-29 3:6:29 GMT) GPU Reset Status Reset Required : N/A Drain and Reset Recommended : N/A GSP Firmware Version : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x02 Device : 0x00 Domain : 0x0000 Base Classcode : 0x3 Sub Classcode : 0x2 Device Id : 0x20B510DE Bus Id : 00000000:02:00.0 Sub System Id : 0x159510DE GPU Link Info PCIe Generation Max : N/A Current : N/A Device Current : N/A Device Max : N/A Host Max : N/A Link Width Max : N/A Current : N/A Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : N/A Replay Number Rollovers : N/A Tx Throughput : N/A Rx Throughput : N/A Atomic Caps Inbound : N/A Atomic Caps Outbound : N/A Fan Speed : N/A Performance State : P0 Clocks Event Reasons : N/A Sparse Operation Mode : N/A FB Memory Usage Total : 81920 MiB Reserved : 5908 MiB Used : 0 MiB Free : 76011 MiB BAR1 Memory Usage Total : 4096 MiB Used : 0 MiB Free : 4096 MiB Conf Compute Protected Memory Usage Total : 0 MiB Used : 0 MiB Free : 0 MiB Compute Mode : Default Utilization Gpu : N/A Memory : N/A Encoder : N/A Decoder : N/A JPEG : N/A OFA : N/A Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 ECC Mode Current : Enabled Pending : Enabled ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable Parity : N/A SRAM Uncorrectable SEC-DED : N/A DRAM Correctable : 0 DRAM Uncorrectable : 0 Aggregate SRAM Correctable : N/A SRAM Uncorrectable Parity : N/A SRAM Uncorrectable SEC-DED : N/A DRAM Correctable : 0 DRAM Uncorrectable : 0 SRAM Threshold Exceeded : N/A Aggregate Uncorrectable SRAM Sources SRAM L2 : N/A SRAM SM : N/A SRAM Microcontroller : N/A SRAM PCIE : N/A SRAM Other : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : N/A GPU T.Limit Temp : N/A GPU Shutdown Temp : N/A GPU Slowdown Temp : N/A GPU Max Operating Temp : N/A GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A GPU Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A GPU Memory Power Readings Power Draw : N/A Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1275 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Deferred Clocks Memory : N/A Max Clocks Graphics : N/A SM : N/A Memory : N/A Video : N/A Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Fabric State : N/A Status : N/A CliqueId : N/A ClusterUUID : N/A Health Bandwidth : N/A Processes : None ``` ``` nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Tue_Feb_27_16:19:38_PST_2024 Cuda compilation tools, release 12.4, V12.4.99 Build cuda_12.4.r12.4/compiler.33961263_0 ```

the problem is that some nvml apis are disabled for mig enabled gpus for safety reasons. Also forcing the NIM profile to use vLLM will provide errors as both tensorRT and vLLM use pynvml to verify gpu resources.

So far I’m using passthrough gpus.

alvgaona · July 30, 2024, 4:02am

Same issue here with an RTX 3060.

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama-3_1-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/#:~:text=This%20license%20agreement%20(%E2%80%9CAgreement%E2%80%9D,algorithms%2C%20parameters%2C%20configuration%20files%2C).

ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement, Built with Llama.

ERROR 07-30 03:59:14.306 utils.py:21] Could not find a profile that is currently runnable with the detected hardware. Please check the system information below and make sure you have enough free GPUs.
SYSTEM INFO
- Free GPUs: <None>
- Non-free GPUs:
  -  [2504:10de] (0) NVIDIA GeForce RTX 3060 [current utilization: 13%]

And the nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.86                 Driver Version: 551.86         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060      WDDM  |   00000000:01:00.0  On |                  N/A |
|  0%   50C    P0             41W /  170W |    1642MiB /  12288MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

joe173 · August 25, 2024, 11:15am

I am able to run some NIMs on a Titan RTX on an Ubuntu system. No VMWare, Linux as the primary machine

My learning experience: Manually validating compatibility and running NVIDIA (NIM) container images

tommy.l.walker · October 21, 2024, 12:59am

Daniel, I got it to work with the following steps:

SET Windows machine variables

SET NGC_API_KEY=<YOUR_KEY_HERE>
SET LOCAL_NIM_CACHE=..cache\nim
mkdir -p “%LOCAL_NIM_CACHE%”

Windows docker run command

docker run -it --rm ^ --gpus all ^ --shm-size=16GB ^ -e NGC_API_KEY="%NGC_API_KEY%" ^ -v "%LOCAL_NIM_CACHE%:/opt/nim/.cache" ^ -p 8000:8000 ^ nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 ^ python3 -m vllm_nvext.entrypoints.openai.api_server --num-gpu-blocks-override 9000 --gpu-memory-utilization 0.9

The parameters at the bottom of the command are effectively overriding whatever the container is able to detect through WSL. This would probably not be suitable for a production instance, since you are effectively telling the container to just “trust me”, but it works for my 4090 setup. Also you can probably increase the num-gpu-blocks-override to >= 9001 AND <= 10240 as the Mobile RTX4090 seems to be using the AD103 Integrated circuit. Setting it at 9000 worked fine for me, but YMMV. Good luck.

Topic		Replies	Views
How to fix 0 compatible profiles for L40S with mistral-7b-instruct-v03 NIM? Models gpu , nim , mistral-7b-instruct-v03	7	286	November 4, 2024
Blueprint RAG v2.0.0 NVIDIA Blueprints nim , llama-31-70b-instruct , llama , blueprints	1	29	April 24, 2025
Model says there is a compatible profile but fails on data type Models nim , mistral-7b-instruct-v03	4	630	August 21, 2024
Batch processing using NVIDIA NIM \| Docker \| Self-hosted Models python , nim , llama3-8b-instruct , llama-31-8b-instruct , llama	11	229	January 29, 2025
NVIDIA NIM Container with CUDA out of Memory Problem Docker and NVIDIA Docker cuda , ubuntu , docker , nim , llama3-8b-instruct	2	471	September 20, 2024
NIM embedding model downloads but fails with auth error on startup Access/Accounts nim , nv-embedqa-e5-v5	29	614	April 10, 2025
WSL Modulus Docker run error (libnvidia-ml.so.1: file exists: unknown.) Technical Support (PhysicsNeMo Only)	7	6167	June 12, 2023
Unable to Run NIM on H100 GPU Due to Profile Compatibility Issue Despite Sufficient GPU Resources Models nim , llama-31-8b-instruct , llama	1	184	November 12, 2024
[SUPPORT] Workbench Example Project: Hybrid RAG NVIDIA AI Workbench workbench-example-project	93	1963	April 7, 2025
Getting Started With NVIDIA NIM Tutorial Issues with NGC Registry Access/Accounts ubuntu , nim , llm , llama3-8b-instruct	7	1325	July 24, 2024

RTX 4090 shows as "non-free GPU" when running NIM model in docker

SET Windows machine variables

Windows docker run command

Related topics