Help needed: A40 in VMware ESXi 7.0u2

Hi guys,

I’m trying to use Nvidia A40 for Xendesktop in VMware 7.0u2 here. Now I met a issue that VM with vGPU profile cannot start. Anyone can help please? thanks!

Error when starting VM with vGPU profile (tried different VM/vGPU profile and no luck)

Task Name: Power On virtual machine
Status: Could not initialize plugin ‘libnvidia-vgx.so’ for vGPU ‘nvidia_a40-6q’. Failed to start the virtual machine. Module DevicePowerOn power on failed.
Initiator: administrator@vsphere.local
Error stack:
Module DevicePowerOn power on failed.
Could not initialize plugin ‘libnvidia-vgx.so’ for vGPU ‘nvidia_a40-6q’.

Environment

Server: Dell R740 with newest BIOS 2.11.2
ESXi: 7.0u2, build 17867351

nvidia-smi output

Tue Aug 17 02:17:03 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63 Driver Version: 470.63 CUDA Version: N/A |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:AF:00.0 Off | Off |
| 0% 50C P8 37W / 300W | 0MiB / 48687MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

ECC memory

I searched internet and many posts said the error “Could not initialize plugin ‘libnvidia-vgx.so’” is because of enabled ECC memory, but I’ve confirmed ECC memory disabled in nvidia-smi -q output

==============NVSMI LOG==============

Timestamp : Tue Aug 17 02:36:02 2021
Driver Version : 470.63
CUDA Version : Not Found

Attached GPUs : 1
GPU 00000000:AF:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1320921024371
GPU UUID : GPU-2aefb9b1-5506-f46e-b53f-70ea9f2db5d8
Minor Number : 0
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0xaf00
GPU Part Number : 900-2G133-0000-000
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xAF
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:AF:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 48687 MiB
Used : 0 MiB
Free : 48687 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled

Error message in VM log

I see a error “NVOS status 0x17” in the log, anyone know its meaning?

2021-08-16T08:17:29.087Z| vmx| | A000: ConfigDB: Unsetting “vmiop.guestVgpuVersion”
2021-08-16T08:17:29.111Z| vmx| | I005: VMIOP: Registered device 0000:af:00.0
2021-08-16T08:17:29.123Z| vmx| | A000: ConfigDB: Setting pciPassthru0.pgpu = “2235145A0606060606060600000002”
2021-08-16T08:17:29.123Z| vmx| | I005: VMIOP: Enabling checkpoint support
2021-08-16T08:17:29.123Z| vmx| | I005: VMIOP: Initializing plugin vmiop-display
2021-08-16T08:17:29.125Z| vmx| | E002: vmiop_log: NVOS status 0x17
2021-08-16T08:17:29.125Z| vmx| | E002: vmiop_log: Assertion Failed at 0xf8fb34b3:97
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: 16 frames returned by backtrace
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(_nv005327vgpu+0x35) [0x58f8ffb615]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x7f6f8) [0x58f8fb76f8]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x7b4b3) [0x58f8fb34b3]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x99b97) [0x58f8fd1b97]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x9caeb) [0x58f8fd4aeb]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libvmx-vmiop.so(+0x91f4) [0x58f8b2f1f4]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x3adf98) [0x58b0d7ff98]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x2dc924) [0x58b0cae924]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x2dc45c) [0x58b0cae45c]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x2dd557) [0x58b0caf557]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x2e82bb) [0x58b0cba2bb]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x25a0c5) [0x58b0c2c0c5]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x25a8e2) [0x58b0c2c8e2]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x24e741) [0x58b0c20741]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /lib64/libc.so.6(__libc_start_main+0xed) [0x58f4082b2d]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: /bin/vmx(+0x24f115) [0x58b0c21115]
2021-08-16T08:17:29.126Z| vmx| | E002: vmiop_log: (0x0): Initialization: Failed to alloc host vgpu device handle error 1
2021-08-16T08:17:29.127Z| vmx| | E002: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 1 (unable to setup host connection state)
2021-08-16T08:17:29.127Z| vmx| | E002: vmiop_log: (0x0): Initialization: init_device_instance failed error 1
2021-08-16T08:17:29.127Z| vmx| | E002: vmiop_log: display_init failed for inst: 0
2021-08-16T08:17:29.127Z| vmx| | E002: VMIOP: Plugin vmiop-display initialization failed: 1

Do you have vCenter also on 7.0.2?

Regards Simon

yes, vcenter 7.0.2, newest version

How much system memory did you assign?

the server has 256GB memory. for VM, I assigned 8GB.

image

Please try to increase the ESX VM Version. Not sur if 11 will work with vGPU 13

I tried VM with newest version 19 but no luck.

image

hi Simon,

We also borrowed a V100 for comparison. So after replaced A40 with V100 on the same server, VM started successfully with no error. I really have no clue now, please help. thanks!

Which server model do you use for testing? Sounds like A40 is not working properly with the hardware. Ampere GPUs require SR-IOV enable in BIOS.

1 Like

Thank you so much Simon! After enable SRIOV in the BIOS of our R740 server, the VM started successfully and works fine now. Really appreciate your help!