Help needed: A40 in VMware ESXi 7.0u2

Hi guys,

I’m trying to use Nvidia A40 for Xendesktop in VMware 7.0u2 here. Now I met a issue that VM with vGPU profile cannot start. Anyone can help please? thanks!

Error when starting VM with vGPU profile (tried different VM/vGPU profile and no luck)

Task Name: Power On virtual machine
Status: Could not initialize plugin ‘libnvidia-vgx.so’ for vGPU ‘nvidia_a40-6q’. Failed to start the virtual machine. Module DevicePowerOn power on failed.
Initiator: administrator@vsphere.local
Error stack:
Module DevicePowerOn power on failed.
Could not initialize plugin ‘libnvidia-vgx.so’ for vGPU ‘nvidia_a40-6q’.

Environment

Server: Dell R740 with newest BIOS 2.11.2
ESXi: 7.0u2, build 17867351

nvidia-smi output

Tue Aug 17 02:17:03 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63 Driver Version: 470.63 CUDA Version: N/A |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:AF:00.0 Off | Off |
| 0% 50C P8 37W / 300W | 0MiB / 48687MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

ECC memory

I searched internet and many posts said the error “Could not initialize plugin ‘libnvidia-vgx.so’” is because of enabled ECC memory, but I’ve confirmed ECC memory disabled in nvidia-smi -q output

==============NVSMI LOG==============

Timestamp : Tue Aug 17 02:36:02 2021
Driver Version : 470.63
CUDA Version : Not Found

Attached GPUs : 1
GPU 00000000:AF:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1320921024371
GPU UUID : GPU-2aefb9b1-5506-f46e-b53f-70ea9f2db5d8
Minor Number : 0
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0xaf00
GPU Part Number : 900-2G133-0000-000
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xAF
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:AF:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 48687 MiB
Used : 0 MiB
Free : 48687 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled

Error message in VM log

I see a error “NVOS status 0x17” in the log, anyone know its meaning?

2021-08-16T08:17:29.087Z| vmx| | A000: ConfigDB: Unsetting “vmiop.guestVgpuVersion”
2021-08-16T08:17:29.111Z| vmx| | I005: VMIOP: Registered device 0000:af:00.0
2021-08-16T08:17:29.123Z| vmx| | A000: ConfigDB: Setting pciPassthru0.pgpu = “2235145A0606060606060600000002”
2021-08-16T08:17:29.123Z| vmx| | I005: VMIOP: Enabling checkpoint support
2021-08-16T08:17:29.123Z| vmx| | I005: VMIOP: Initializing plugin vmiop-display
2021-08-16T08:17:29.125Z| vmx| | E002: vmiop_log: NVOS status 0x17
2021-08-16T08:17:29.125Z| vmx| | E002: vmiop_log: Assertion Failed at 0xf8fb34b3:97
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: 16 frames returned by backtrace
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(_nv005327vgpu+0x35) [0x58f8ffb615]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x7f6f8) [0x58f8fb76f8]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x7b4b3) [0x58f8fb34b3]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x99b97) [0x58f8fd1b97]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x9caeb) [0x58f8fd4aeb]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libvmx-vmiop.so(+0x91f4) [0x58f8b2f1f4]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x3adf98) [0x58b0d7ff98]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x2dc924) [0x58b0cae924]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x2dc45c) [0x58b0cae45c]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x2dd557) [0x58b0caf557]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x2e82bb) [0x58b0cba2bb]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x25a0c5) [0x58b0c2c0c5]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x25a8e2) [0x58b0c2c8e2]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x24e741) [0x58b0c20741]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /lib64/libc.so.6(__libc_start_main+0xed) [0x58f4082b2d]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x24f115) [0x58b0c21115]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: (0x0): Initialization: Failed to alloc host vgpu device handle error 1
2021-08-16T08:17:29.127Z| vmx| | E002: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 1 (unable to setup host connection state)
2021-08-16T08:17:29.127Z| vmx| | E002: vmiop_log: (0x0): Initialization: init_device_instance failed error 1
2021-08-16T08:17:29.127Z| vmx| | E002: vmiop_log: display_init failed for inst: 0
2021-08-16T08:17:29.127Z| vmx| | E002: VMIOP: Plugin vmiop-display initialization failed: 1

Do you have vCenter also on 7.0.2?

Regards Simon

yes, vcenter 7.0.2, newest version

How much system memory did you assign?

the server has 256GB memory. for VM, I assigned 8GB.

image

Please try to increase the ESX VM Version. Not sur if 11 will work with vGPU 13

I tried VM with newest version 19 but no luck.

image

hi Simon,

We also borrowed a V100 for comparison. So after replaced A40 with V100 on the same server, VM started successfully with no error. I really have no clue now, please help. thanks!

Which server model do you use for testing? Sounds like A40 is not working properly with the hardware. Ampere GPUs require SR-IOV enable in BIOS.

Thank you so much Simon! After enable SRIOV in the BIOS of our R740 server, the VM started successfully and works fine now. Really appreciate your help!