A6000 in vGPU 13.0 and ESXi 7.0u2 : Failed to start vGPU instance

Hi Team,

I’m trying to use Nvidia A6000 in VMware 7.0u2 here. Now I met a issue that VM with vGPU profile cannot start. Anyone can help please? thanks!

Error when starting VM with vGPU profile

Key: haTask-4-vim.VirtualMachine.powerOn-191
Description: Power On this virtual machine
State: Failed - Could not initialize plugin 'libnvidia-vgx.so' for vGPU 'nvidia_rtxa6000-4q'.
Errors:  
 * Could not initialize plugin 'libnvidia-vgx.so' for vGPU 'nvidia_rtxa6000-4q'.
 * Module 'DevicePowerOn' power on failed.
 * Failed to start the virtual machine.

Environment

Server: Supermicro SYS-7049-GP-TRT
ESXi: 7.0u2, build 17867351
vCenter  7.0.2 build 18356314

nvidia-smi output

Timestamp                                 : Fri Aug 27 14:22:13 2021
Driver Version                            : 470.63
CUDA Version                              : Not Found
Attached GPUs                             : 1
GPU 00000000:18:00.0
    Product Name                          : NVIDIA RTX A6000
    Product Brand                         : NVIDIA RTX
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Enabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : xxxxxxxxx
    GPU UUID                              : GPU-78fea16a-3767-9d77-64db-f6424ee3a417
    Minor Number                          : 0
    VBIOS Version                         : 94.02.5C.00.02
    MultiGPU Board                        : No
    Board ID                              : 0x1800
    GPU Part Number                       : 900-5G133-2200-000
    Module ID                             : 0
    Inforom Version
        Image Version                     : G133.0500.00.05
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : SR-IOV
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x18
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x223010DE
        Bus Id                            : 00000000:18:00.0
        Sub System Id                     : 0x145910DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 30 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 48685 MiB
        Used                              : 0 MiB
        Free                              : 48685 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 1 MiB
        Free                              : 255 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 26 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 40.72 W
        Power Limit                       : 300.00 W
        Default Power Limit               : 300.00 W
        Enforced Power Limit              : 300.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1800 MHz
        Memory                            : 8001 MHz
    Default Applications Clocks
        Graphics                          : 1800 MHz
        Memory                            : 8001 MHz
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 8001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 756.250 mV
    Processes                             : None

vmware.log file

2021-08-27T14:16:46.992Z| vmx| | I005: VMIOP: config /usr/share/nvidia/vgx/nvidia_rtxa6000-4q.conf
2021-08-27T14:16:46.994Z| vmx| | I005: VMIOP: Using VER_4_0 symbol /usr/lib64/vmware/plugin/libnvidia-vgx.so:vmiop_display_vmiop_plugin
2021-08-27T14:16:47.013Z| vmx| | I005: VMIOP: Registered device 0000:18:00.0
2021-08-27T14:16:47.022Z| vmx| | A000: ConfigDB: Setting pciPassthru0.pgpu = "223014590606060606060600000002"
2021-08-27T14:16:47.022Z| vmx| | I005: VMIOP: Enabling checkpoint support
2021-08-27T14:16:47.022Z| vmx| | I005: VMIOP: Initializing plugin vmiop-display
2021-08-27T14:16:47.023Z| vmx| | E002: vmiop_log: NVOS status 0x17
2021-08-27T14:16:47.023Z| vmx| | E002: vmiop_log: Assertion Failed at 0xc22f44b3:97
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: 16 frames returned by backtrace
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(_nv005327vgpu+0x35) [0x98c233c615]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x7f6f8) [0x98c22f86f8]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x7b4b3) [0x98c22f44b3]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x99b97) [0x98c2312b97]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libnvidia-vgx.so(+0x9caeb) [0x98c2315aeb]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /usr/lib64/vmware/plugin/libvmx-vmiop.so(+0x91f4) [0x98c1e701f4]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /bin/vmx(+0x3adf98) [0x987a11df98]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /bin/vmx(+0x2dc924) [0x987a04c924]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /bin/vmx(+0x2dc45c) [0x987a04c45c]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /bin/vmx(+0x2dd557) [0x987a04d557]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /bin/vmx(+0x2e82bb) [0x987a0582bb]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /bin/vmx(+0x25a0c5) [0x9879fca0c5]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /bin/vmx(+0x25a8e2) [0x9879fca8e2]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /bin/vmx(+0x24e741) [0x9879fbe741]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /lib64/libc.so.6(__libc_start_main+0xed) [0x98bd420b2d]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: /bin/vmx(+0x24f115) [0x9879fbf115]
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: (0x0): Initialization: Failed to alloc host vgpu device handle error 1
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 1 (unable to setup host connection state)
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: (0x0): Initialization: init_device_instance failed error 1
2021-08-27T14:16:47.024Z| vmx| | E002: vmiop_log: display_init failed for inst: 0
2021-08-27T14:16:47.024Z| vmx| | E002: VMIOP: Plugin vmiop-display initialization failed: 1
2021-08-27T14:16:47.024Z| vmx| | I005: [msg.vmx.plugin.vmiop.vgpu.failed] Could not initialize plugin 'libnvidia-vgx.so' for vGPU 'nvidia_rtxa6000-4q'.
2021-08-27T14:16:47.024Z| vmx| | I005: Module 'DevicePowerOn' power on failed.
2021-08-27T14:16:47.024Z| vmx| | I005: VMX_PowerOn: ModuleTable_PowerOn = 0
2021-08-27T14:16:47.024Z| vmx| | I005: Device Interface (pciPassthru0) powering off.
2021-08-27T14:16:47.058Z| vmx| | I005: DeviceIfPowerOff: indicating asyncIOThread to exit.
2021-08-27T14:16:47.059Z| svga| | I005: SVGA thread is exiting the main loop
2021-08-27T14:16:47.059Z| vmx| | I005: Destroying virtual dev for scsi0:0 vscsi=8209
2021-08-27T14:16:47.059Z| vmx| | I005: VMMon_VSCSIStopVports: No such target on adapter
2021-08-27T14:16:47.059Z| vmx| | I005: SVMotion_PowerOff: Not running Storage vMotion. Nothing to do
2021-08-27T14:16:47.059Z| vmx| | I005: MKS/SVGA threads are stopped
2021-08-27T14:16:47.059Z| mks| | I005: MKS-RenderMain: Stopped MKSBasicOps
2021-08-27T14:16:47.060Z| mks| | I005: MKS PowerOff
2021-08-27T14:16:47.060Z| svga| | I005: SVGA thread is exiting
2021-08-27T14:16:47.060Z| mks| | I005: MKS thread is exiting
2021-08-27T14:16:47.060Z| vmx| | W003: 
2021-08-27T14:16:47.060Z| vmx| | I005: scsi0:0: numIOs = 0 numMergedIOs = 0 numSplitIOs = 0 ( 0.0%)
2021-08-27T14:16:47.060Z| vmx| | I005: Closing disk 'scsi0:0'
2021-08-27T14:16:47.060Z| vmx| | I005: DISKLIB-VMFS  : "/vmfs/volumes/61284f28-b5f5a1e8-ea19-ac1f6ba1bf98/vgpu-01/vgpu-01-flat.vmdk" : closed.
2021-08-27T14:16:47.061Z| vmx| | I005: Vix: [mainDispatch.c:1164]: VMAutomationPowerOff: Powering off.
2021-08-27T14:16:47.062Z| vmx| | W003: /vmfs/volumes/61284f28-b5f5a1e8-ea19-ac1f6ba1bf98/vgpu-01/vgpu-01.vmx: Cannot remove symlink /var/run/vmware/0/517531320_2101621/configFile: No such file or directory
2021-08-27T14:16:47.062Z| vmx| | I005: WORKER: asyncOps=1 maxActiveOps=1 maxPending=1 maxCompleted=0
2021-08-27T14:16:47.122Z| vmx| | I005: Vix: [mainDispatch.c:4205]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0
2021-08-27T14:16:47.122Z| vmx| | I005: 
2021-08-27T14:16:47.122Z| vmx| | I005+ Power on failure messages: Could not initialize plugin 'libnvidia-vgx.so' for vGPU 'nvidia_rtxa6000-4q'.
2021-08-27T14:16:47.122Z| vmx| | I005+ Module 'DevicePowerOn' power on failed.
2021-08-27T14:16:47.122Z| vmx| | I005+ Failed to start the virtual machine.
2021-08-27T14:16:47.122Z| vmx| | I005+ 
2021-08-27T14:16:47.122Z| vmx| | I005: Vix: [mainDispatch.c:4205]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1870, success=1 additionalError=0
2021-08-27T14:16:47.122Z| vmx| | I005: Transitioned vmx/execState/val to poweredOff
2021-08-27T14:16:47.122Z| vmx| | I005: Vix: [mainDispatch.c:4205]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1870, success=0 additionalError=0
2021-08-27T14:16:47.122Z| vmx| | I005: Vix: [mainDispatch.c:4243]: Error VIX_E_FAIL in VMAutomation_ReportPowerOpFinished(): Unknown error
2021-08-27T14:16:47.122Z| vmx| | I005: Vix: [mainDispatch.c:4205]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1870, success=1 additionalError=0
2021-08-27T14:16:47.122Z| vmx| | I005: Transitioned vmx/execState/val to poweredOff
2021-08-27T14:16:47.127Z| vmx| | I005: Vix: [mainDispatch.c:815]: VMAutomation_LateShutdown()
2021-08-27T14:16:47.127Z| vmx| | I005: Vix: [mainDispatch.c:770]: VMAutomationCloseListenerSocket. Closing listener socket.
2021-08-27T14:16:47.128Z| vmx| | I005: Flushing VMX VMDB connections
2021-08-27T14:16:47.128Z| vmx| | I005: VigorTransport_ServerCloseClient: Closing transport 987B73F5F0 (err = 0)
2021-08-27T14:16:47.128Z| vmx| | I005: VigorTransport_ServerDestroy: server destroyed.
2021-08-27T14:16:47.128Z| vmx| | I005: VMX exit (0).
2021-08-27T14:16:47.129Z| vmx| | I005: OBJLIB-LIB: ObjLib cleanup done.
2021-08-27T14:16:47.129Z| vmx| | I005: AIOMGR-S : stat o=3 r=16 w=0 i=0 br=98416 bw=0
2021-08-27T14:16:47.129Z| vmx| | W003: VMX has left the building: 0.

If you need more information, please let me know.
Best reagrds.
Kaka

1 Like

Yes I can :) You need to change the GPU Mode first. A6000 is a workstation GPU and needs to be switched into displaymode off and Big BAR1 size.

You will need to register for the ability to download the modeselector tool.

Regards Simon

Thank you for your information.
I succeeded to deploy the vGPU instance by switching display mode(i.e. Disabled the multi display mode)

By the way, the “Display Mode Selector Tool” did not run on EXSi (Of course, I have already installed the VIB on this host before I run this tool.).
So, I changed the setting to path-though the GPU on EXSi and created the new instance with attaching the path-though A6000 and installed the NV driver, and then reboot. After boot-up, run this tool, reboot the EXSi, restored the gpu setting on EXSi and so on. This is so complexed…
Do you have a recommended way to switch the mode after installing the vSphere?

Best regards.
Kaka

Unfortunately there is no easy way to use the tool with ESX. Keep in mind that A6000 is designed for workstation and A40 for servers. For those customers with “special” use cases the additional effort with display selector tool needs to be taken.

1 Like

I am having the same issue. What do you mean by needs to be switched into displaymode off and Big BAR1 size?

I have display mode disabled, is that sufficient? How do I adjust the BAR1 size?

Display Mode : Disabled
Display Active : Disabled

BAR1 Memory Usage
Total : 256 MiB
Used : 1 MiB
Free : 255 MiB

Thanks!

You need to use the display mode selector tool!

How do we verify it’s disabled after run the command?

Check BAR1 size with nvidia-smi -q. It needs to show 64GB instead of 256MB.

Hi, used the selector tool, but there is no option to change BAR1 to 64GB. How can I go about changing it to 64GB?

You only have 2 options to change: BIG BAR = 64GB or medium BAR =8GB. Displayless mode is BIG BAR

regards
Simon

Hello Simon,
hi folks,

i’ve used the displaymodeselector (RHEL 8 VM with GPU passthrough on vSphere 7u2) to change the GPU Mode of an RTXA5000 to ‘disabled display’, for usage with vGPU/GRID.

The tool claims no error, but after rebooting the ESXi, nvidia-smi still reports :

[root@esxi3:~] nvidia-smi -i 0 -q | grep -i display
Display Mode : Enabled
Display Active : Disabled

I’ve tried it several times but the ‘display enabled’ mode still persist, and therefore it is not possible to start any VM with vGPU/GRID Profile (c or q).
Error: "could not initialize plugin libnvidia-vgx.so for vGPU “nvidia_rtxa5000_8q”
VIB is 460,73.02, also tested wit 460.91 and 470.x, no change. ECC is disabled.

Any ideas?

Thanks in Advance,
Oliver

Display Mode always shows enabled. Relevant ist only the BAR1 size. Please post the output of BAR1.

Regards Simon

[root@esxi3:~] nvidia-smi -i 0 -q | grep -i BAR -C4
FB Memory Usage
Total : 24258 MiB
Used : 0 MiB
Free : 24258 MiB
BAR1 Memory Usage
Total : 32768 MiB
Used : 1 MiB
Free : 32767 MiB
Compute Mode : Default

I forgot to mention in my 1st post:
VMX Settings were made too (pciPassthruMMIO ) and boot mode of the VMs is EFI .
Thanks,
Oliver

GPU Mode looks good. Did you enable SR-IOV? Shared Direct configured in vCenter?

Hello Simon,
yes SR-IOV is enabled, Above 4 G Encoding too,
and also Shared Direct in vCenter.
Thats why im running out of ideas … ;-)
Thanks
Oliver

Can you post nvidia-smi output from ESX host? Do you have vCenter also on 7.0.2?

Hi Simon,

vCenter is latest Version on one of my ESXis, all ESXi are 7u2d.
I’ve also update Host BIOS to the latest version, SR-IOV is enable IOMMU works (dmesg logs)

dmesg -i | grep iommu

TSC: 209076 cpu0:1)BootConfig: 711: iommuMapReservedMem = 1 (1)
2021-09-19T15:30:44.936Z cpu42:2097718)Loading module dma_mapper_iommu …
2021-09-19T15:30:44.939Z cpu42:2097718)Elf: 2060: module dma_mapper_iommu has license VMware
2021-09-19T15:30:44.939Z cpu42:2097718)Mod: 4789: Initialization of dma_mapper_iommu succeeded with module ID 2.
2021-09-19T15:30:44.939Z cpu42:2097718)dma_mapper_iommu loaded successfully.

2021-09-19T15:30:48.218Z cpu38:2097864)DMA: 1044: Protecting DMA engine ‘NVIDIADmaEngine’. Putting parent PCI device 0000:4f:00.0 (—THIS IS THE RTX-A5000 -----) in IOMMU domain 0x4308162604e0.
2021-09-19T15:30:48.218Z cpu38:2097864)DMA: 687: DMA Engine ‘NVIDIADmaEngine’ created using mapper ‘DMAIOMMU’.

But the error remains when starting a VM attached to the RTXA5000 (c/q GRID profile):
“could not load plug-in libnvidia-vgx.so …”

nvidia-smi output is below

Thanks in Advance, Oliver

==============NVSMI LOG==============

Timestamp : Sun Sep 19 15:45:57 2021
Driver Version : 470.63
CUDA Version : Not Found

Attached GPUs : 1
GPU 00000000:4F:00.0
Product Name : NVIDIA RTX A5000
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1321721023294
GPU UUID : GPU-8e853934-c60b-b92b-ba73-fd8d421663bd
Minor Number : 0
VBIOS Version : 94.02.6D.00.05
MultiGPU Board : No
Board ID : 0x4f00
GPU Part Number : 900-5G132-2500-000
Module ID : 0
Inforom Version
Image Version : G132.0500.00.01
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x4F
Device : 0x00
Domain : 0x0000
Device Id : 0x223110DE
Bus Id : 00000000:4F:00.0
Sub System Id : 0x147E10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 30 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 24258 MiB
Used : 0 MiB
Free : 24258 MiB
BAR1 Memory Usage
Total : 32768 MiB
Used : 1 MiB
Free : 32767 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 34 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 90 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 24.71 W
Power Limit : 230.00 W
Default Power Limit : 230.00 W
Enforced Power Limit : 230.00 W
Min Power Limit : 100.00 W
Max Power Limit : 230.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1695 MHz
Memory : 8001 MHz
Default Applications Clocks
Graphics : 1695 MHz
Memory : 8001 MHz
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 8001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 668.750 mV
Processes : None

Everything looks OK from host side. I’m also running out of ideas :(

Assure the BIOS and Firmware are absolutely up to date.
Enable SR-IOV for the A6000 in ESXi for the host (in vSphere Client).
MMIO base set to 2TB, low is 256GB,
ASPM – auto
4G decoding enable
Use a small Virtual Machine RAM amount, 8 or 16GB, not 32 or greater.