L40S unavailable when other GPUs are present on ESXi host

Good afternoon,

Running into a strange issue with vGPU on ESXi. Current versions:
Hypervisor: ESXi 8.0.3 24280767
vGPU Drivers: 16.7
Server: Supermicro X11DPG-SN, 2x Xeon 8272CL
GPUs: L40S, A16 and P40

Problem:
I have 4 hosts with identical configuration: each with an A16 and a P40 GPU. Recently I added a L40S to each server, but this causes the driver to be unstable and no VMs are able to use any GPU resources. The A16 and P40 GPU show up as graphics devices normally, but the L40S appears with “0” VRAM. Trying to launch a VM on the P40 or A16 results in a “device is not available on the host” error. No L40S vGPU profiles are visible at all on the host.

If i remove the A16 and P40 (leaving the L40S by itself) everything works perfectly fine. If I switch the P40 and A16 to PCIe passthough, the L40S works fine. If either GPU is added back to the host, the L40S stops working.

The only error I am able to find is the vGPU driver restarting over and over with the following message:

2024-10-14T23:17:06.579Z In(182) vmkernel: cpu82:2099624)NVRM: GPU 0000:db:00.0: RmInitAdapter failed! (0x25:0x56:1468)
2024-10-14T23:17:06.579Z In(182) vmkernel: cpu82:2099624)NVRM: rm_init_adapter failed for device 1
2024-10-14T23:17:06.784Z In(182) vmkernel: cpu82:2099624)NVRM: GPU at 0000:db:00.0 has software scheduler ENABLED with policy BEST_EFFORT.

I have tried a few different ideas following some research here. The only thing that made a change was to set “NVreg_EnableGpuFirmware=0”. This allowed the GPU to be visible on the host without any errors and the P40, A16 both work fine. However, launching a VM using the L40S results in a strange issue where the host allocates the GPU to the VM, but the VM is unable to initialize the GPU and never loads the driver.

I have attached the bug report with everything left “at defaults” and would love some assistance to drill this issue down.
nvidia-bug-report.log (3.0 MB)

This is simply not working, nor supported. There was a technical change between Ampere and ADA which doesn’t allow these generation to run in parallel.

Well that definitely explains what I am seeing. Do you have any documentation I can reference that outlines that? I’ll need to explain this and how to move forward.

Thank you for the reply!

Nothing I can share publicly. We changed from a software based RM to a hardware based RM (RISCV chip) on the GPU starting with ADA GPUs. Hope this helps to understand that now either software or hardware based RM is possible and not both.

1 Like

Gotcha, I will pass this on. Thank you again for taking the time to reply!

I have DL380 server and installed one L40S on it. esxi recognized it as L40S but when I want get information with nvidia-smi it says can’t communicate with GPU.
what is problem?