Could not initialize plugin 'libnvidia-vgx.so' for vGPU 'nvidia_a16-4q'

Hello

Getting an error while trying to start VM on vCENTER environment 7.0.3 – build 21477706
On ESX host 535.104.06
On Guest VDI 537.13
SR-IOV is ENABLED
GPU are SHARED DIRECT
ECC have been DISABLE

2024-01-16T10:23:14.325Z In(05) vmx - VMIOP: config /usr/share/nvidia/vgx/nvidia_a16-4q.conf
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: Assertion Failed at 0xc3eacc96:143
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: 17 frames returned by backtrace
2024-01-16T10:23:15.547Z Er(02) vmx - vmiop_log: (0x0): Initialization: Failed to alloc kernel host vgpu device handle error 1
2024-01-16T10:23:15.547Z In(05) vmx - [msg.vmx.plugin.vmiop.vgpu.failed] Could not initialize plugin ‘libnvidia-vgx.so’ for vGPU ‘nvidia_a16-4q’.
2024-01-16T10:23:15.547Z In(05) vmx - Module ‘DevicePowerOn’ power on failed.

Do you have any clue how to fix the issue ?
Thanks

Stéphane

Is the GPU recognized properly on the host?
Can you run nvidia-smi on the host?

Yes - all device ID are recognized under each host under graphic devices below result.
VM t-shirt - 4Q ; should be able to stack 4Q * 4 * 2 A16 boards = 32 VDIs by host (2 boards * 64 GB/4Q)

When ESXi is been rebooted it sounds better for a time

==============NVSMI LOG==============

Timestamp : Thu Jan 11 07:19:11 2024
Driver Version : 535.104.06
CUDA Version : Not Found
vGPU Driver Capability
Heterogenous Multi-vGPU : Supported

Attached GPUs : 8
GPU 00000000:1B:00.0
Product Name : NVIDIA A16
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Not Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Not Supported
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1320323003512
GPU UUID : GPU-d70940bb-8088-f4da-c3d1-6378f6e58f28
Minor Number : 0
VBIOS Version : 94.07.54.00.45
MultiGPU Board : Yes
Board ID : 0x1900
Board Part Number : 900-2G171-0100-130
GPU Part Number : 25B6-890-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G171.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x1B
Device : 0x00
Domain : 0x0000
Device Id : 0x25B610DE
Bus Id : 00000000:1B:00.0
Sub System Id : 0x14A910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 0 %
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16380 MiB
Reserved : 264 MiB
Used : 15616 MiB
Free : 498 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 1 MiB
Free : 16383 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 64 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 32 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 16.31 W
Current Power Limit : 62.50 W
Requested Power Limit : 62.50 W
Default Power Limit : 62.50 W
Min Power Limit : 48.75 W
Max Power Limit : 62.50 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Default Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1755 MHz
SM : 1755 MHz
Memory : 6251 MHz
Video : 1635 MHz
Max Customer Boost Clocks
Graphics : 1755 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 668.750 mV
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3696368
Type : C+G
Name : vdsdc107839w931
Used GPU Memory : 3904 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3969565
Type : C+G
Name : vdsdc107839w936
Used GPU Memory : 3904 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3976782
Type : C+G
Name : vdsdc107839w933
Used GPU Memory : 3904 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 4005826
Type : C+G
Name : vdsdc107839w953
Used GPU Memory : 3904 MiB

GPU 00000000:1D:00.0
Product Name : NVIDIA A16
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Not Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Not Supported
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1320323003512
GPU UUID : GPU-2cda91f4-cf80-eb70-61d4-d9eb66eb1c70
Minor Number : 1
VBIOS Version : 94.07.54.00.45
MultiGPU Board : Yes
Board ID : 0x1900
Board Part Number : 900-2G171-0100-130
GPU Part Number : 25B6-890-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G171.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
GPU Reset Status
Reset Required : Yes
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x1D
Device : 0x00
Domain : 0x0000
Device Id : 0x25B610DE
Bus Id : 00000000:1D:00.0
Sub System Id : 0x14A910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : System is not in ready state
Performance State : P8
Clocks Event Reasons : System is not in ready state
FB Memory Usage
Total : 16380 MiB
Reserved : 264 MiB
Used : 7808 MiB
Free : 8307 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 1 MiB
Free : 16383 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 64 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 34 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 16.21 W
Current Power Limit : 62.50 W
Requested Power Limit : 62.50 W
Default Power Limit : 62.50 W
Min Power Limit : 48.75 W
Max Power Limit : 62.50 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : System is not in ready state
SM : System is not in ready state
Memory : System is not in ready state
Video : System is not in ready state
Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Default Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1755 MHz
SM : 1755 MHz
Memory : 6251 MHz
Video : 1635 MHz
Max Customer Boost Clocks
Graphics : 1755 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2661499
Type : C+G
Name : vdsdc107839w941
Used GPU Memory : 3904 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3297142
Type : C+G
Name : vdsdc107839w962
Used GPU Memory : 3904 MiB

GPU 00000000:1F:00.0
Product Name : NVIDIA A16
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Not Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Not Supported
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1320323003512
GPU UUID : GPU-07c88ade-b670-1c48-964f-0ba2f117563a
Minor Number : 2
VBIOS Version : 94.07.54.00.45
MultiGPU Board : Yes
Board ID : 0x1900
Board Part Number : 900-2G171-0100-130
GPU Part Number : 25B6-890-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G171.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x1F
Device : 0x00
Domain : 0x0000
Device Id : 0x25B610DE
Bus Id : 00000000:1F:00.0
Sub System Id : 0x14A910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 0 %
Performance State : P8
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16380 MiB
Reserved : 264 MiB
Used : 7808 MiB
Free : 8307 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 1 MiB
Free : 16383 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 64 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 30 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 16.54 W
Current Power Limit : 62.50 W
Requested Power Limit : 62.50 W
Default Power Limit : 62.50 W
Min Power Limit : 48.75 W
Max Power Limit : 62.50 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Default Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1755 MHz
SM : 1755 MHz
Memory : 6251 MHz
Video : 1635 MHz
Max Customer Boost Clocks
Graphics : 1755 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 681.250 mV
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2662899
Type : C+G
Name : vdsdc107839w947
Used GPU Memory : 3904 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3293058
Type : C+G
Name : vdsdc107839w961
Used GPU Memory : 3904 MiB

GPU 00000000:21:00.0
Product Name : NVIDIA A16
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Not Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Not Supported
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1320323003512
GPU UUID : GPU-d20b2f2a-bd78-d9a5-afd9-09ba91bca8da
Minor Number : 3
VBIOS Version : 94.07.54.00.45
MultiGPU Board : Yes
Board ID : 0x1900
Board Part Number : 900-2G171-0100-130
GPU Part Number : 25B6-890-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G171.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x21
Device : 0x00
Domain : 0x0000
Device Id : 0x25B610DE
Bus Id : 00000000:21:00.0
Sub System Id : 0x14A910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 0 %
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16380 MiB
Reserved : 264 MiB
Used : 15616 MiB
Free : 498 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 1 MiB
Free : 16383 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 1
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 64 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 28 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 16.09 W
Current Power Limit : 62.50 W
Requested Power Limit : 62.50 W
Default Power Limit : 62.50 W
Min Power Limit : 48.75 W
Max Power Limit : 62.50 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Default Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1755 MHz
SM : 1755 MHz
Memory : 6251 MHz
Video : 1635 MHz
Max Customer Boost Clocks
Graphics : 1755 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 675.000 mV
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2663169
Type : C+G
Name : vdsdc107839w949
Used GPU Memory : 3904 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2664117
Type : C+G
Name : vdsdc107839w908
Used GPU Memory : 3904 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3870398
Type : C+G
Name : vdsdc107839w993
Used GPU Memory : 3904 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 4295906
Type : C+G
Name : vdsdc107839w932
Used GPU Memory : 3904 MiB

GPU 00000000:CE:00.0
Product Name : NVIDIA A16
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Not Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Not Supported
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324222010159
GPU UUID : GPU-ac99f846-cbc7-2b80-ffc0-06d0954abbc3
Minor Number : 4
VBIOS Version : 94.07.54.00.45
MultiGPU Board : Yes
Board ID : 0xcc00
Board Part Number : 900-2G171-0100-130
GPU Part Number : 25B6-890-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G171.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xCE
Device : 0x00
Domain : 0x0000
Device Id : 0x25B610DE
Bus Id : 00000000:CE:00.0
Sub System Id : 0x14A910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 0 %
Performance State : P8
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16380 MiB
Reserved : 264 MiB
Used : 3904 MiB
Free : 12211 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 1 MiB
Free : 16383 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 64 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 30 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 15.73 W
Current Power Limit : 62.50 W
Requested Power Limit : 62.50 W
Default Power Limit : 62.50 W
Min Power Limit : 48.75 W
Max Power Limit : 62.50 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Default Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1755 MHz
SM : 1755 MHz
Memory : 6251 MHz
Video : 1635 MHz
Max Customer Boost Clocks
Graphics : 1755 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 675.000 mV
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3304656
Type : C+G
Name : vdsdc107839w963
Used GPU Memory : 3904 MiB

GPU 00000000:D0:00.0
Product Name : NVIDIA A16
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Not Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Not Supported
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324222010159
GPU UUID : GPU-071cff69-3d70-8329-3a8f-9949da151303
Minor Number : 5
VBIOS Version : 94.07.54.00.45
MultiGPU Board : Yes
Board ID : 0xcc00
Board Part Number : 900-2G171-0100-130
GPU Part Number : 25B6-890-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G171.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xD0
Device : 0x00
Domain : 0x0000
Device Id : 0x25B610DE
Bus Id : 00000000:D0:00.0
Sub System Id : 0x14A910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 0 %
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16380 MiB
Reserved : 264 MiB
Used : 0 MiB
Free : 16115 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 1 MiB
Free : 16383 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 64 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 31 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 15.65 W
Current Power Limit : 62.50 W
Requested Power Limit : 62.50 W
Default Power Limit : 62.50 W
Min Power Limit : 48.75 W
Max Power Limit : 62.50 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Default Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1755 MHz
SM : 1755 MHz
Memory : 6251 MHz
Video : 1635 MHz
Max Customer Boost Clocks
Graphics : 1755 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 675.000 mV
Fabric
State : N/A
Status : N/A
Processes : None

GPU 00000000:D2:00.0
Product Name : NVIDIA A16
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Not Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Not Supported
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324222010159
GPU UUID : GPU-59775f11-d54a-a173-5a2d-7a3fe7d48b1e
Minor Number : 6
VBIOS Version : 94.07.54.00.45
MultiGPU Board : Yes
Board ID : 0xcc00
Board Part Number : 900-2G171-0100-130
GPU Part Number : 25B6-890-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G171.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xD2
Device : 0x00
Domain : 0x0000
Device Id : 0x25B610DE
Bus Id : 00000000:D2:00.0
Sub System Id : 0x14A910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 0 %
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16380 MiB
Reserved : 264 MiB
Used : 0 MiB
Free : 16115 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 1 MiB
Free : 16383 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 64 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 28 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 15.16 W
Current Power Limit : 62.50 W
Requested Power Limit : 62.50 W
Default Power Limit : 62.50 W
Min Power Limit : 48.75 W
Max Power Limit : 62.50 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Default Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1755 MHz
SM : 1755 MHz
Memory : 6251 MHz
Video : 1635 MHz
Max Customer Boost Clocks
Graphics : 1755 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 675.000 mV
Fabric
State : N/A
Status : N/A
Processes : None

GPU 00000000:D4:00.0
Product Name : NVIDIA A16
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Not Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Not Supported
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324222010159
GPU UUID : GPU-8c25ab27-33de-eaca-5794-9089173ebb2b
Minor Number : 7
VBIOS Version : 94.07.54.00.45
MultiGPU Board : Yes
Board ID : 0xcc00
Board Part Number : 900-2G171-0100-130
GPU Part Number : 25B6-890-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G171.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xD4
Device : 0x00
Domain : 0x0000
Device Id : 0x25B610DE
Bus Id : 00000000:D4:00.0
Sub System Id : 0x14A910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 0 %
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16380 MiB
Reserved : 264 MiB
Used : 0 MiB
Free : 16115 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 1 MiB
Free : 16383 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 64 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 27 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 15.35 W
Current Power Limit : 62.50 W
Requested Power Limit : 62.50 W
Default Power Limit : 62.50 W
Min Power Limit : 48.75 W
Max Power Limit : 62.50 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Default Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1755 MHz
SM : 1755 MHz
Memory : 6251 MHz
Video : 1635 MHz
Max Customer Boost Clocks
Graphics : 1755 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 675.000 mV
Fabric
State : N/A
Status : N/A
Processes : None

Hello

Do you need additional details and/or checks ?
Thanks

Which OEM hardware are you using?
Whenever I have seen this issue it was related to SR-IOV in the past. But you already mentioned that it should be enabled. Anyways, I would recommend to ask the OEM for the right BIOS settings as often there is even more to enable in the vendor BIOS to make it work properly.

EMC vXRAIL series - full logs stated below

Seen also this may be missed this part

When you enable the “SR-IOV Global” on the BIOS for the Dell Servers the configuration is not pushed to the NIC setting so you will need to enable SR-IOV on the NIC as well.

Boot to BIOS Setup:

  • Go to System BIOS >> Device Settings >> Ethernet Converged Network Adapter X710 >> Device Level Configuration >> Virtualization Mode = SR-IOV.

  • Save settings and reboot.

2024-01-16T10:23:14.325Z In(05) vmx - VMIOP: config /usr/share/nvidia/vgx/nvidia_a16-4q.conf

2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: Assertion Failed at 0xc3eacc96:143
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: 17 frames returned by backtrace
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(_nv009078vgpu+0x35) [0x78c3f0b3d5]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x691a8) [0x78c3eb01a8]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x65c96) [0x78c3eacc96]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x7afa7) [0x78c3ec1fa7]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x85a6f) [0x78c3ecca6f]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x888fb) [0x78c3ecf8fb]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /usr/lib64/vmware/plugin/libvmx-vmiop.so(+0x9234) [0x78c3a3e234]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /bin/vmx(+0x3d0974) [0x787ba0c974]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /bin/vmx(+0x2e97e4) [0x787b9257e4]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /bin/vmx(+0x2e9334) [0x787b925334]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /bin/vmx(+0x2ea4ab) [0x787b9264ab]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /bin/vmx(+0x2f507b) [0x787b93107b]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /bin/vmx(+0x265d85) [0x787b8a1d85]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /bin/vmx(+0x2665b2) [0x787b8a25b2]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /bin/vmx(+0x25a411) [0x787b896411]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /lib64/libc.so.6(__libc_start_main+0xed) [0x78bed8cd5d]
2024-01-16T10:23:15.546Z Er(02) vmx - vmiop_log: /bin/vmx(+0x25ade5) [0x787b896de5]
2024-01-16T10:23:15.547Z Er(02) vmx - vmiop_log: (0x0): Initialization: Failed to alloc kernel host vgpu device handle error 1
2024-01-16T10:23:15.547Z Er(02) vmx - vmiop_log: (0x0): init_device_instance failed for inst 0 with error 1 (unable to setup host connection state)
2024-01-16T10:23:15.547Z Er(02) vmx - vmiop_log: (0x0): Initialization: init_device_instance failed error 1
2024-01-16T10:23:15.547Z Er(02) vmx - vmiop_log: display_init failed for inst: 0
2024-01-16T10:23:15.547Z Er(02) vmx - VMIOP: Plugin vmiop-display initialization failed: 1
2024-01-16T10:23:15.547Z In(05) vmx - [msg.vmx.plugin.vmiop.vgpu.failed] Could not initialize plugin ‘libnvidia-vgx.so’ for vGPU ‘nvidia_a16-4q’.
2024-01-16T10:23:15.547Z In(05) vmx - Module ‘DevicePowerOn’ power on failed.
2024-01-16T10:23:15.547Z In(05) vmx - VMX_PowerOn: ModuleTable_PowerOn = 0
2024-01-16T10:23:15.547Z In(05) vmx - Device Interface (pciPassthru0) powering off.

2024-01-16T10:23:16.748Z In(05)+ vmx - Power on failure messages: Could not initialize plugin ‘libnvidia-vgx.so’ for vGPU ‘nvidia_a16-4q’.
2024-01-16T10:23:16.748Z In(05)+ vmx - Module ‘DevicePowerOn’ power on failed.
2024-01-16T10:23:16.748Z In(05)+ vmx - Failed to start the virtual machine.

Really points to an issue with SR-IOV.
Make sure that
VT-D/IOMMU
SR-IOV

is enabled in the BIOS

hmm, unfortunately I don’t have a Dell machine to verify the settings. Looks good so far I would say but I don’t know if there is anything else that need to be enabled in the Dell BIOS

Hello

Hope that you are doing well
just notice that I can see an error here - guess should review XiD Error ?
00000000:D4:00.0 in ERR
| 7 NVIDIA A16 | 00000000:D4:00.0 | 0% |
| Unk… |

±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.06 Driver Version: 535.104.06 CUDA Version: N/A |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A16 On | 00000000:1B:00.0 Off | 0 |
| 0% 33C P8 16W / 62W | 7296MiB / 15356MiB | 11% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A16 On | 00000000:1D:00.0 Off | 0 |
| 0% 45C P0 33W / 62W | 7296MiB / 15356MiB | 2% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 2 NVIDIA A16 On | 00000000:1F:00.0 Off | 0 |
| 0% 28C P8 16W / 62W | 3648MiB / 15356MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 3 NVIDIA A16 On | 00000000:21:00.0 Off | 0 |
| 0% 26C P8 15W / 62W | 3648MiB / 15356MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 4 NVIDIA A16 On | 00000000:CE:00.0 Off | 0 |
| 0% 33C P8 16W / 62W | 3648MiB / 15356MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 5 NVIDIA A16 On | 00000000:D0:00.0 Off | 0 |
| 0% 39C P0 32W / 62W | 7296MiB / 15356MiB | 11% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 6 NVIDIA A16 On | 00000000:D2:00.0 Off | 0 |
| 0% 28C P8 16W / 62W | 3648MiB / 15356MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 7 NVIDIA A16 On | 00000000:D4:00.0 Off | 0 |
|ERR! 26C P8 15W / 62W | 0MiB / 15356MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

[/vmfs/volumes//66517b65-8266-8b93-2eff-***] nvidia-smi vgpu
Fri Feb 9 10:12:32 2024
±----------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.06 Driver Version: 535.104.06 |
|---------------------------------±-----------------------------±-----------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 NVIDIA A16 | 00000000:1B:00.0 | 0% |
| 3252187720 NVIDIA A16-4Q | 1991… vdsdc107839w048 | 0% |
| 3252188023 NVIDIA A16-4Q | 1992… vdsdc107839w121 | 0% |
±--------------------------------±-----------------------------±-----------+
| 1 NVIDIA A16 | 00000000:1D:00.0 | 65% |
| 3252115905 NVIDIA A16-4Q | 1959… vdsdc107839w054 | 37% |
| 3252116661 NVIDIA A16-4Q | 1959… vdsdc107839w140 | 0% |
±--------------------------------±-----------------------------±-----------+
| 2 NVIDIA A16 | 00000000:1F:00.0 | 0% |
| 3252116634 NVIDIA A16-4Q | 1959… vdsdc107839w136 | 0% |
±--------------------------------±-----------------------------±-----------+
| 3 NVIDIA A16 | 00000000:21:00.0 | 0% |
| 3252171942 NVIDIA A16-4Q | 1980… vdsdc107839w080 | 0% |
±--------------------------------±-----------------------------±-----------+
| 4 NVIDIA A16 | 00000000:CE:00.0 | 0% |
| 3252177090 NVIDIA A16-4Q | 1983… vdsdc107839w075 | 0% |
±--------------------------------±-----------------------------±-----------+
| 5 NVIDIA A16 | 00000000:D0:00.0 | 0% |
| 3252116193 NVIDIA A16-4Q | 1959… vdsdc107839w086 | 0% |
| 3252117006 NVIDIA A16-4Q | 1959… vdsdc107839w013 | 0% |
±--------------------------------±-----------------------------±-----------+
| 6 NVIDIA A16 | 00000000:D2:00.0 | 4% |
| 3252192192 NVIDIA A16-4Q | 1994… vdsdc107839w116 | 12% |
±--------------------------------±-----------------------------±-----------+
| 7 NVIDIA A16 | 00000000:D4:00.0 | 0% |
| Unk… |
±--------------------------------±-----------------------------±-----------+

For GPU 00000000:D4:00.0 - seen System is not ready

Clocks
    Graphics                          : System is not in ready state
    SM                                : System is not in ready state
    Memory                            : System is not in ready state
    Video                             : System is not in ready state

Having also the case on ESXi where all good - but same issue…

Dear All

Hope that you are doing well
Progressing - guess error code 62 is coming up - XID error with 45

Here 1D:00.0 ID in error

| 1 NVIDIA A16 On | 00000000:1D:00.0 Off | Off |
|ERR! 47C P0 33W / 62W | 7808MiB / 16380MiB | 0% Default |

Capturing XiD errors

It sounds that looks like drivers errors Code 62 & 45 (Internal micro-controller halt & when running multiple cuda applications and hitting a DBE) - something to deal with memory clock ?

2024-02-13T07:15:24.846Z cpu41:2177573)NVRM: GPU Board Serial Number: 13240*****
2024-02-13T07:15:24.846Z cpu41:2177573)NVRM: Xid (PCI:0000:1d:00): 62, pid=‘’, name=, 0000(0000) 00000000 00000000

2024-02-13T07:15:24.861Z cpu41:2177573)NVRM: Xid (PCI:0000:1d:00): 45, pid=2350126, name=, Ch 00000420
2024-02-13T07:15:24.862Z cpu41:2177573)NVRM: Xid (PCI:0000:1d:00): 45, pid=2350126, name=, Ch 00000421

[root@:~] nvidia-smi
Tue Feb 13 09:28:55 2024
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.06 Driver Version: 535.104.06 CUDA Version: N/A |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A16 On | 00000000:1B:00.0 Off | Off |
| 0% 34C P8 17W / 62W | 11712MiB / 16380MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A16 On | 00000000:1D:00.0 Off | Off |
|ERR! 47C P0 33W / 62W | 7808MiB / 16380MiB | 0% Default |

Is this fixed in the meantime?
There is a current bug with latest Dell BIOS preventing VMs to boot. Revert to older BIOS is the only workaround.

Hi SImon,
I can confirm that yesterday I’ve encountered the same problem on a DELL R7525 (amd socket and A40 cards) after a bios update
The latest bios avaiable (Dell Server PowerEdge BIOS R6525/R7525 Version 2.14.1) cause the error.
After some test, we’ve rolled back to an old version (in our case Version 2.12.4) and the vms powered on correctly.
Thanks