Hi there,
I’m testing 2x RTX Pro 6000 Blackwell cards in a ESXi host and experienced the following problem multiple times already:
- After doing heavy CUDA compute on a single VM the CUDA initialization fails at some point.
- On the host (ESXi),
nvidia-smishowsERRfor one of the cards (andrequires reset). - Performing a reboot in vSphere results in a PSOD:
Both GPUs are configured to use Direct Shared graphics in Single Size mode.
The host driver is updated to the latest 580.126.09.
Broadcom docs suggest contacting NVIDIA:
- ESXi host experiences PSOD (purple screen of death) with IOMMU Fault detected for vmgfx3/nvidia
- ESXi PSOD - IOMMU Fault detected for (<varies>/nvidia) - Reason: 0x6 (PTE not set to allow Read.)
Below is the nvidia-smi -q output for both cards. The one that crashes is the first one (id 0):
Summary
[root@esxi01:~] nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Tue Jan 20 12:34:37 2026
Driver Version : 580.126.08
CUDA Version : Not Found
vGPU Driver Capability
Heterogenous Multi-vGPU : Supported
Attached GPUs : 2
GPU 00000000:17:00.0
Product Name : NVIDIA RTX PRO 6000 Blackwell Server Edition
Product Brand : NVIDIA
Product Architecture : Blackwell
Display Mode : Requested functionality has been deprecated
Display Attached : Yes
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Supported
Homogeneous Placements : Supported
MIG Time-Slicing : Supported
MIG Time-Slicing Mode : Disabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323825040191
GPU UUID : GPU-d131a566-137d-7ca0-9fbf-202ee2dc080e
GPU PDI : 0xfe30e000092cee5c
Minor Number : 0
VBIOS Version : 98.02.67.00.0A
MultiGPU Board : No
Board ID : 0x1700
Board Part Number : 900-2G153-0000-000
GPU Part Number : 2BB5-895-A1
FRU Part Number : N/A
Platform Info
Chassis Serial Number :
Slot Number : 0
Tray Index : 0
Host ID : 1
Peer Type : Direct Connected
Module Id : 1
GPU Fabric GUID : 0x0000000000000000
Inforom Version
Image Version : G153.0210.00.02
OEM Object : 2.1
ECC Object : 7.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU C2C Mode : Disabled
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
vGPU Heterogeneous Mode : Disabled
GPU Recovery Action : None
GSP Firmware Version : 580.126.08
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x17
Device : 0x00
Domain : 0x0000
Device Id : 0x2BB510DE
Bus Id : 00000000:17:00.0
Sub System Id : 0x204E10DE
GPU Link Info
PCIe Generation
Max : 5
Current : 1
Device Current : 1
Device Max : 5
Host Max : N/A
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 496 KB/s
Rx Throughput : 587 KB/s
Atomic Caps Outbound : N/A
Atomic Caps Inbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Fan Speed : N/A
Performance State : P8
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Clocks Event Reasons Counters
SW Power Capping : 1441734 us
Sync Boost : 0 us
SW Thermal Slowdown : 0 us
HW Thermal Slowdown : 0 us
HW Power Braking : 0 us
Sparse Operation Mode : N/A
FB Memory Usage
Total : 97887 MiB
Reserved : 2288 MiB
Used : 0 MiB
Free : 95600 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
GPU : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
DRAM Encryption Mode
Current : Disabled
Pending : Disabled
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
SRAM Threshold Exceeded : N/A
Aggregate Uncorrectable SRAM Sources
SRAM L2 : N/A
SRAM SM : N/A
SRAM Microcontroller : N/A
SRAM PCIE : N/A
SRAM Other : N/A
Channel Repair Pending : No
TPC Repair Pending : No
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 512 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 27 C
GPU T.Limit Temp : 58 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating T.Limit Temp : N/A
GPU Power Readings
Average Power Draw : 39.09 W
Instantaneous Power Draw : 39.12 W
Current Power Limit : 600.00 W
Requested Power Limit : 600.00 W
Default Power Limit : 600.00 W
Min Power Limit : 300.00 W
Max Power Limit : 600.00 W
GPU Memory Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Module Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Power Smoothing : N/A
Workload Power Profiles
Requested Profiles : N/A
Enforced Profiles : N/A
Clocks
Graphics : 180 MHz
SM : 180 MHz
Memory : 405 MHz
Video : 600 MHz
Applications Clocks
Graphics : 2430 MHz
Memory : 12481 MHz
Default Applications Clocks
Graphics : 2430 MHz
Memory : 12481 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2430 MHz
SM : 2430 MHz
Memory : 12481 MHz
Video : 2107 MHz
Max Customer Boost Clocks
Graphics : 2430 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Fabric
State : N/A
Status : N/A
CliqueId : N/A
ClusterUUID : N/A
Health
Summary : N/A
Bandwidth : N/A
Route Recovery in progress : N/A
Route Unhealthy : N/A
Access Timeout Recovery : N/A
Incorrect Configuration : N/A
Partition Assigned : N/A
Processes : None
Capabilities
EGM : disabled
GPU 00000000:2A:00.0
Product Name : NVIDIA RTX PRO 6000 Blackwell Server Edition
Product Brand : NVIDIA
Product Architecture : Blackwell
Display Mode : Requested functionality has been deprecated
Display Attached : Yes
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Supported
Homogeneous Placements : Supported
MIG Time-Slicing : Supported
MIG Time-Slicing Mode : Disabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323825040902
GPU UUID : GPU-6e3d220b-af14-c6be-d7f2-7678427af18c
GPU PDI : 0x425458bab645bc57
Minor Number : 1
VBIOS Version : 98.02.67.00.0A
MultiGPU Board : No
Board ID : 0x2a00
Board Part Number : 900-2G153-0000-000
GPU Part Number : 2BB5-895-A1
FRU Part Number : N/A
Platform Info
Chassis Serial Number :
Slot Number : 0
Tray Index : 0
Host ID : 1
Peer Type : Direct Connected
Module Id : 1
GPU Fabric GUID : 0x0000000000000000
Inforom Version
Image Version : G153.0210.00.02
OEM Object : 2.1
ECC Object : 7.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU C2C Mode : Disabled
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
vGPU Heterogeneous Mode : Disabled
GPU Recovery Action : None
GSP Firmware Version : 580.126.08
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x2A
Device : 0x00
Domain : 0x0000
Device Id : 0x2BB510DE
Bus Id : 00000000:2A:00.0
Sub System Id : 0x204E10DE
GPU Link Info
PCIe Generation
Max : 5
Current : 1
Device Current : 1
Device Max : 5
Host Max : N/A
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 501 KB/s
Rx Throughput : 468 KB/s
Atomic Caps Outbound : N/A
Atomic Caps Inbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Fan Speed : N/A
Performance State : P8
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Clocks Event Reasons Counters
SW Power Capping : 1398768 us
Sync Boost : 0 us
SW Thermal Slowdown : 0 us
HW Thermal Slowdown : 0 us
HW Power Braking : 0 us
Sparse Operation Mode : N/A
FB Memory Usage
Total : 97887 MiB
Reserved : 2288 MiB
Used : 0 MiB
Free : 95600 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
GPU : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
DRAM Encryption Mode
Current : Disabled
Pending : Disabled
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
SRAM Threshold Exceeded : N/A
Aggregate Uncorrectable SRAM Sources
SRAM L2 : N/A
SRAM SM : N/A
SRAM Microcontroller : N/A
SRAM PCIE : N/A
SRAM Other : N/A
Channel Repair Pending : No
TPC Repair Pending : No
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 512 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 28 C
GPU T.Limit Temp : 56 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating T.Limit Temp : N/A
GPU Power Readings
Average Power Draw : 38.00 W
Instantaneous Power Draw : 38.19 W
Current Power Limit : 600.00 W
Requested Power Limit : 600.00 W
Default Power Limit : 600.00 W
Min Power Limit : 300.00 W
Max Power Limit : 600.00 W
GPU Memory Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Module Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Power Smoothing : N/A
Workload Power Profiles
Requested Profiles : N/A
Enforced Profiles : N/A
Clocks
Graphics : 180 MHz
SM : 180 MHz
Memory : 405 MHz
Video : 600 MHz
Applications Clocks
Graphics : 2430 MHz
Memory : 12481 MHz
Default Applications Clocks
Graphics : 2430 MHz
Memory : 12481 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2430 MHz
SM : 2430 MHz
Memory : 12481 MHz
Video : 2107 MHz
Max Customer Boost Clocks
Graphics : 2430 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Fabric
State : N/A
Status : N/A
CliqueId : N/A
ClusterUUID : N/A
Health
Summary : N/A
Bandwidth : N/A
Route Recovery in progress : N/A
Route Unhealthy : N/A
Access Timeout Recovery : N/A
Incorrect Configuration : N/A
Partition Assigned : N/A
Processes : None
Capabilities
EGM : disabled


