RTX Pro 6000 Blackwell SE: IOMMU Fault detected (ESXi)

Hi there,

I’m testing 2x RTX Pro 6000 Blackwell cards in a ESXi host and experienced the following problem multiple times already:

  1. After doing heavy CUDA compute on a single VM the CUDA initialization fails at some point.
  2. On the host (ESXi), nvidia-smi shows ERR for one of the cards (and requires reset).
  3. Performing a reboot in vSphere results in a PSOD:

Both GPUs are configured to use Direct Shared graphics in Single Size mode.

The host driver is updated to the latest 580.126.09.

Broadcom docs suggest contacting NVIDIA:

Below is the nvidia-smi -q output for both cards. The one that crashes is the first one (id 0):

Summary
[root@esxi01:~] nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                              : Tue Jan 20 12:34:37 2026
Driver Version                                         : 580.126.08
CUDA Version                                           : Not Found
vGPU Driver Capability
        Heterogenous Multi-vGPU                        : Supported

Attached GPUs                                          : 2
GPU 00000000:17:00.0
    Product Name                                       : NVIDIA RTX PRO 6000 Blackwell Server Edition
    Product Brand                                      : NVIDIA
    Product Architecture                               : Blackwell
    Display Mode                                       : Requested functionality has been deprecated
    Display Attached                                   : Yes
    Display Active                                     : Disabled
    Persistence Mode                                   : Enabled
    Addressing Mode                                    : N/A
    vGPU Device Capability
        Fractional Multi-vGPU                          : Supported
        Heterogeneous Time-Slice Profiles              : Supported
        Heterogeneous Time-Slice Sizes                 : Supported
        Homogeneous Placements                         : Supported
        MIG Time-Slicing                               : Supported
        MIG Time-Slicing Mode                          : Disabled
    MIG Mode
        Current                                        : Disabled
        Pending                                        : Disabled
    Accounting Mode                                    : Enabled
    Accounting Mode Buffer Size                        : 4000
    Driver Model
        Current                                        : N/A
        Pending                                        : N/A
    Serial Number                                      : 1323825040191
    GPU UUID                                           : GPU-d131a566-137d-7ca0-9fbf-202ee2dc080e
    GPU PDI                                            : 0xfe30e000092cee5c
    Minor Number                                       : 0
    VBIOS Version                                      : 98.02.67.00.0A
    MultiGPU Board                                     : No
    Board ID                                           : 0x1700
    Board Part Number                                  : 900-2G153-0000-000
    GPU Part Number                                    : 2BB5-895-A1
    FRU Part Number                                    : N/A
    Platform Info
        Chassis Serial Number                          :
        Slot Number                                    : 0
        Tray Index                                     : 0
        Host ID                                        : 1
        Peer Type                                      : Direct Connected
        Module Id                                      : 1
        GPU Fabric GUID                                : 0x0000000000000000
    Inforom Version
        Image Version                                  : G153.0210.00.02
        OEM Object                                     : 2.1
        ECC Object                                     : 7.16
        Power Management Object                        : N/A
    Inforom BBX Object Flush
        Latest Timestamp                               : N/A
        Latest Duration                                : N/A
    GPU Operation Mode
        Current                                        : N/A
        Pending                                        : N/A
    GPU C2C Mode                                       : Disabled
    GPU Virtualization Mode
        Virtualization Mode                            : Host VGPU
        Host VGPU Mode                                 : SR-IOV
        vGPU Heterogeneous Mode                        : Disabled
    GPU Recovery Action                                : None
    GSP Firmware Version                               : 580.126.08
    IBMNPU
        Relaxed Ordering Mode                          : N/A
    PCI
        Bus                                            : 0x17
        Device                                         : 0x00
        Domain                                         : 0x0000
        Device Id                                      : 0x2BB510DE
        Bus Id                                         : 00000000:17:00.0
        Sub System Id                                  : 0x204E10DE
        GPU Link Info
            PCIe Generation
                Max                                    : 5
                Current                                : 1
                Device Current                         : 1
                Device Max                             : 5
                Host Max                               : N/A
            Link Width
                Max                                    : 16x
                Current                                : 16x
        Bridge Chip
            Type                                       : N/A
            Firmware                                   : N/A
        Replays Since Reset                            : 0
        Replay Number Rollovers                        : 0
        Tx Throughput                                  : 496 KB/s
        Rx Throughput                                  : 587 KB/s
        Atomic Caps Outbound                           : N/A
        Atomic Caps Inbound                            : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
    Fan Speed                                          : N/A
    Performance State                                  : P8
    Clocks Event Reasons
        Idle                                           : Not Active
        Applications Clocks Setting                    : Not Active
        SW Power Cap                                   : Not Active
        HW Slowdown                                    : Not Active
            HW Thermal Slowdown                        : Not Active
            HW Power Brake Slowdown                    : Not Active
        Sync Boost                                     : Not Active
        SW Thermal Slowdown                            : Not Active
        Display Clock Setting                          : Not Active
    Clocks Event Reasons Counters
        SW Power Capping                               : 1441734 us
        Sync Boost                                     : 0 us
        SW Thermal Slowdown                            : 0 us
        HW Thermal Slowdown                            : 0 us
        HW Power Braking                               : 0 us
    Sparse Operation Mode                              : N/A
    FB Memory Usage
        Total                                          : 97887 MiB
        Reserved                                       : 2288 MiB
        Used                                           : 0 MiB
        Free                                           : 95600 MiB
    BAR1 Memory Usage
        Total                                          : 131072 MiB
        Used                                           : 1 MiB
        Free                                           : 131071 MiB
    Conf Compute Protected Memory Usage
        Total                                          : 0 MiB
        Used                                           : 0 MiB
        Free                                           : 0 MiB
    Compute Mode                                       : Default
    Utilization
        GPU                                            : 0 %
        Memory                                         : 0 %
        Encoder                                        : 0 %
        Decoder                                        : 0 %
        JPEG                                           : 0 %
        OFA                                            : 0 %
    Encoder Stats
        Active Sessions                                : 0
        Average FPS                                    : 0
        Average Latency                                : 0
    FBC Stats
        Active Sessions                                : 0
        Average FPS                                    : 0
        Average Latency                                : 0
    DRAM Encryption Mode
        Current                                        : Disabled
        Pending                                        : Disabled
    ECC Mode
        Current                                        : Disabled
        Pending                                        : Disabled
    ECC Errors
        Volatile
            SRAM Correctable                           : N/A
            SRAM Uncorrectable Parity                  : N/A
            SRAM Uncorrectable SEC-DED                 : N/A
            DRAM Correctable                           : N/A
            DRAM Uncorrectable                         : N/A
        Aggregate
            SRAM Correctable                           : N/A
            SRAM Uncorrectable Parity                  : N/A
            SRAM Uncorrectable SEC-DED                 : N/A
            DRAM Correctable                           : N/A
            DRAM Uncorrectable                         : N/A
            SRAM Threshold Exceeded                    : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                                    : N/A
            SRAM SM                                    : N/A
            SRAM Microcontroller                       : N/A
            SRAM PCIE                                  : N/A
            SRAM Other                                 : N/A
        Channel Repair Pending                         : No
        TPC Repair Pending                             : No
    Retired Pages
        Single Bit ECC                                 : N/A
        Double Bit ECC                                 : N/A
        Pending Page Blacklist                         : N/A
    Remapped Rows
        Correctable Error                              : 0
        Uncorrectable Error                            : 0
        Pending                                        : No
        Remapping Failure Occurred                     : No
        Bank Remap Availability Histogram
            Max                                        : 512 bank(s)
            High                                       : 0 bank(s)
            Partial                                    : 0 bank(s)
            Low                                        : 0 bank(s)
            None                                       : 0 bank(s)
    Temperature
        GPU Current Temp                               : 27 C
        GPU T.Limit Temp                               : 58 C
        GPU Shutdown T.Limit Temp                      : -5 C
        GPU Slowdown T.Limit Temp                      : -2 C
        GPU Max Operating T.Limit Temp                 : 0 C
        GPU Target Temperature                         : N/A
        Memory Current Temp                            : N/A
        Memory Max Operating T.Limit Temp              : N/A
    GPU Power Readings
        Average Power Draw                             : 39.09 W
        Instantaneous Power Draw                       : 39.12 W
        Current Power Limit                            : 600.00 W
        Requested Power Limit                          : 600.00 W
        Default Power Limit                            : 600.00 W
        Min Power Limit                                : 300.00 W
        Max Power Limit                                : 600.00 W
    GPU Memory Power Readings
        Average Power Draw                             : N/A
        Instantaneous Power Draw                       : N/A
    Module Power Readings
        Average Power Draw                             : N/A
        Instantaneous Power Draw                       : N/A
        Current Power Limit                            : N/A
        Requested Power Limit                          : N/A
        Default Power Limit                            : N/A
        Min Power Limit                                : N/A
        Max Power Limit                                : N/A
    Power Smoothing                                    : N/A
    Workload Power Profiles
        Requested Profiles                             : N/A
        Enforced Profiles                              : N/A
    Clocks
        Graphics                                       : 180 MHz
        SM                                             : 180 MHz
        Memory                                         : 405 MHz
        Video                                          : 600 MHz
    Applications Clocks
        Graphics                                       : 2430 MHz
        Memory                                         : 12481 MHz
    Default Applications Clocks
        Graphics                                       : 2430 MHz
        Memory                                         : 12481 MHz
    Deferred Clocks
        Memory                                         : N/A
    Max Clocks
        Graphics                                       : 2430 MHz
        SM                                             : 2430 MHz
        Memory                                         : 12481 MHz
        Video                                          : 2107 MHz
    Max Customer Boost Clocks
        Graphics                                       : 2430 MHz
    Clock Policy
        Auto Boost                                     : N/A
        Auto Boost Default                             : N/A
    Fabric
        State                                          : N/A
        Status                                         : N/A
        CliqueId                                       : N/A
        ClusterUUID                                    : N/A
        Health
            Summary                                    : N/A
            Bandwidth                                  : N/A
            Route Recovery in progress                 : N/A
            Route Unhealthy                            : N/A
            Access Timeout Recovery                    : N/A
            Incorrect Configuration                    : N/A
            Partition Assigned                         : N/A
    Processes                                          : None
    Capabilities
        EGM                                            : disabled

GPU 00000000:2A:00.0
    Product Name                                       : NVIDIA RTX PRO 6000 Blackwell Server Edition
    Product Brand                                      : NVIDIA
    Product Architecture                               : Blackwell
    Display Mode                                       : Requested functionality has been deprecated
    Display Attached                                   : Yes
    Display Active                                     : Disabled
    Persistence Mode                                   : Enabled
    Addressing Mode                                    : N/A
    vGPU Device Capability
        Fractional Multi-vGPU                          : Supported
        Heterogeneous Time-Slice Profiles              : Supported
        Heterogeneous Time-Slice Sizes                 : Supported
        Homogeneous Placements                         : Supported
        MIG Time-Slicing                               : Supported
        MIG Time-Slicing Mode                          : Disabled
    MIG Mode
        Current                                        : Disabled
        Pending                                        : Disabled
    Accounting Mode                                    : Enabled
    Accounting Mode Buffer Size                        : 4000
    Driver Model
        Current                                        : N/A
        Pending                                        : N/A
    Serial Number                                      : 1323825040902
    GPU UUID                                           : GPU-6e3d220b-af14-c6be-d7f2-7678427af18c
    GPU PDI                                            : 0x425458bab645bc57
    Minor Number                                       : 1
    VBIOS Version                                      : 98.02.67.00.0A
    MultiGPU Board                                     : No
    Board ID                                           : 0x2a00
    Board Part Number                                  : 900-2G153-0000-000
    GPU Part Number                                    : 2BB5-895-A1
    FRU Part Number                                    : N/A
    Platform Info
        Chassis Serial Number                          :
        Slot Number                                    : 0
        Tray Index                                     : 0
        Host ID                                        : 1
        Peer Type                                      : Direct Connected
        Module Id                                      : 1
        GPU Fabric GUID                                : 0x0000000000000000
    Inforom Version
        Image Version                                  : G153.0210.00.02
        OEM Object                                     : 2.1
        ECC Object                                     : 7.16
        Power Management Object                        : N/A
    Inforom BBX Object Flush
        Latest Timestamp                               : N/A
        Latest Duration                                : N/A
    GPU Operation Mode
        Current                                        : N/A
        Pending                                        : N/A
    GPU C2C Mode                                       : Disabled
    GPU Virtualization Mode
        Virtualization Mode                            : Host VGPU
        Host VGPU Mode                                 : SR-IOV
        vGPU Heterogeneous Mode                        : Disabled
    GPU Recovery Action                                : None
    GSP Firmware Version                               : 580.126.08
    IBMNPU
        Relaxed Ordering Mode                          : N/A
    PCI
        Bus                                            : 0x2A
        Device                                         : 0x00
        Domain                                         : 0x0000
        Device Id                                      : 0x2BB510DE
        Bus Id                                         : 00000000:2A:00.0
        Sub System Id                                  : 0x204E10DE
        GPU Link Info
            PCIe Generation
                Max                                    : 5
                Current                                : 1
                Device Current                         : 1
                Device Max                             : 5
                Host Max                               : N/A
            Link Width
                Max                                    : 16x
                Current                                : 16x
        Bridge Chip
            Type                                       : N/A
            Firmware                                   : N/A
        Replays Since Reset                            : 0
        Replay Number Rollovers                        : 0
        Tx Throughput                                  : 501 KB/s
        Rx Throughput                                  : 468 KB/s
        Atomic Caps Outbound                           : N/A
        Atomic Caps Inbound                            : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
    Fan Speed                                          : N/A
    Performance State                                  : P8
    Clocks Event Reasons
        Idle                                           : Not Active
        Applications Clocks Setting                    : Not Active
        SW Power Cap                                   : Not Active
        HW Slowdown                                    : Not Active
            HW Thermal Slowdown                        : Not Active
            HW Power Brake Slowdown                    : Not Active
        Sync Boost                                     : Not Active
        SW Thermal Slowdown                            : Not Active
        Display Clock Setting                          : Not Active
    Clocks Event Reasons Counters
        SW Power Capping                               : 1398768 us
        Sync Boost                                     : 0 us
        SW Thermal Slowdown                            : 0 us
        HW Thermal Slowdown                            : 0 us
        HW Power Braking                               : 0 us
    Sparse Operation Mode                              : N/A
    FB Memory Usage
        Total                                          : 97887 MiB
        Reserved                                       : 2288 MiB
        Used                                           : 0 MiB
        Free                                           : 95600 MiB
    BAR1 Memory Usage
        Total                                          : 131072 MiB
        Used                                           : 1 MiB
        Free                                           : 131071 MiB
    Conf Compute Protected Memory Usage
        Total                                          : 0 MiB
        Used                                           : 0 MiB
        Free                                           : 0 MiB
    Compute Mode                                       : Default
    Utilization
        GPU                                            : 0 %
        Memory                                         : 0 %
        Encoder                                        : 0 %
        Decoder                                        : 0 %
        JPEG                                           : 0 %
        OFA                                            : 0 %
    Encoder Stats
        Active Sessions                                : 0
        Average FPS                                    : 0
        Average Latency                                : 0
    FBC Stats
        Active Sessions                                : 0
        Average FPS                                    : 0
        Average Latency                                : 0
    DRAM Encryption Mode
        Current                                        : Disabled
        Pending                                        : Disabled
    ECC Mode
        Current                                        : Disabled
        Pending                                        : Disabled
    ECC Errors
        Volatile
            SRAM Correctable                           : N/A
            SRAM Uncorrectable Parity                  : N/A
            SRAM Uncorrectable SEC-DED                 : N/A
            DRAM Correctable                           : N/A
            DRAM Uncorrectable                         : N/A
        Aggregate
            SRAM Correctable                           : N/A
            SRAM Uncorrectable Parity                  : N/A
            SRAM Uncorrectable SEC-DED                 : N/A
            DRAM Correctable                           : N/A
            DRAM Uncorrectable                         : N/A
            SRAM Threshold Exceeded                    : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                                    : N/A
            SRAM SM                                    : N/A
            SRAM Microcontroller                       : N/A
            SRAM PCIE                                  : N/A
            SRAM Other                                 : N/A
        Channel Repair Pending                         : No
        TPC Repair Pending                             : No
    Retired Pages
        Single Bit ECC                                 : N/A
        Double Bit ECC                                 : N/A
        Pending Page Blacklist                         : N/A
    Remapped Rows
        Correctable Error                              : 0
        Uncorrectable Error                            : 0
        Pending                                        : No
        Remapping Failure Occurred                     : No
        Bank Remap Availability Histogram
            Max                                        : 512 bank(s)
            High                                       : 0 bank(s)
            Partial                                    : 0 bank(s)
            Low                                        : 0 bank(s)
            None                                       : 0 bank(s)
    Temperature
        GPU Current Temp                               : 28 C
        GPU T.Limit Temp                               : 56 C
        GPU Shutdown T.Limit Temp                      : -5 C
        GPU Slowdown T.Limit Temp                      : -2 C
        GPU Max Operating T.Limit Temp                 : 0 C
        GPU Target Temperature                         : N/A
        Memory Current Temp                            : N/A
        Memory Max Operating T.Limit Temp              : N/A
    GPU Power Readings
        Average Power Draw                             : 38.00 W
        Instantaneous Power Draw                       : 38.19 W
        Current Power Limit                            : 600.00 W
        Requested Power Limit                          : 600.00 W
        Default Power Limit                            : 600.00 W
        Min Power Limit                                : 300.00 W
        Max Power Limit                                : 600.00 W
    GPU Memory Power Readings
        Average Power Draw                             : N/A
        Instantaneous Power Draw                       : N/A
    Module Power Readings
        Average Power Draw                             : N/A
        Instantaneous Power Draw                       : N/A
        Current Power Limit                            : N/A
        Requested Power Limit                          : N/A
        Default Power Limit                            : N/A
        Min Power Limit                                : N/A
        Max Power Limit                                : N/A
    Power Smoothing                                    : N/A
    Workload Power Profiles
        Requested Profiles                             : N/A
        Enforced Profiles                              : N/A
    Clocks
        Graphics                                       : 180 MHz
        SM                                             : 180 MHz
        Memory                                         : 405 MHz
        Video                                          : 600 MHz
    Applications Clocks
        Graphics                                       : 2430 MHz
        Memory                                         : 12481 MHz
    Default Applications Clocks
        Graphics                                       : 2430 MHz
        Memory                                         : 12481 MHz
    Deferred Clocks
        Memory                                         : N/A
    Max Clocks
        Graphics                                       : 2430 MHz
        SM                                             : 2430 MHz
        Memory                                         : 12481 MHz
        Video                                          : 2107 MHz
    Max Customer Boost Clocks
        Graphics                                       : 2430 MHz
    Clock Policy
        Auto Boost                                     : N/A
        Auto Boost Default                             : N/A
    Fabric
        State                                          : N/A
        Status                                         : N/A
        CliqueId                                       : N/A
        ClusterUUID                                    : N/A
        Health
            Summary                                    : N/A
            Bandwidth                                  : N/A
            Route Recovery in progress                 : N/A
            Route Unhealthy                            : N/A
            Access Timeout Recovery                    : N/A
            Incorrect Configuration                    : N/A
            Partition Assigned                         : N/A
    Processes                                          : None
    Capabilities
        EGM                                            : disabled

For what reason are you installing the vGPU host driver? Looks like you don’t use vGPU at all. For PT the GPU you would need to add 256GB MMIO in the VM parameter settings. Did you do so?

@sschaber I do use vGPU (configured 2x 96GiB vGPUs and also the advanced uvm=1 parameters for Unified Memory).

Of course, it would be possible to use PCIe passthrough in this scenario, but the plan is to eventually provide a smaller part of both physical GPUs to different VMs.

So here’s another detail:

The GPU shows up as multiple PCIe devices by default (SR-IOV related I guess?):

On the host UI itself, it shows like this:

The “Needs reboot” lables does not go away after a reboot. Only when enabling SR-IOV in ESXi/vSphere, it shows “Enabled” at some point and this is exactly what I initially did. Could that be a problem? I reverted this change in the meantime (as seen in the screenshots), but so far I don’t know if it makes a difference. I can’t deterministically reproduce the “crash”.

BIOS/EFI has all the requirements configured (above 4G decoding, SR-IOV, ReBar, etc.).

I cannot follow. I don’t see a 96Q or 96C profile in nvidia-smi so how you think you would use vGPU?

Could you send a screenshot from nvidia-smi to see the type of GPU mentioned there?

@sschaber Appologies for not being precise. The nvidia-smi output is from the host (ESXi).

The VM uses 2x 96Q profiles :-)

OK :)

So could you test to add the MMIO value with 512GB to the VM as it may require a lot of MMIO reservation with 2x RTX Pro 6000?

pciPassthru.64bitMMIOSizeGB = “512”

@sschaber Thank you! I’ll try that and monitor the situation for a while and report back here if it worked.

@sschaber It seems like either your suggestion to increase the IOMMU size or disabling SR-IOV (in the ESXi configuration) helped. I did not experience this issue again in after doing these changes. Thank you!

just wondering if this is missing from the docs? The vGPU docs (latest version) only mentions a hand full of cards for above 64G IOMMU size and the Enterpise AI docs does not mention the RTX 6000 Blackwell SE here ( NVIDIA vGPU for Compute — NVIDIA AI Enterprise ) either. RTX 6000 Ada and similar cards on the other hand is included in the list - which makes me think that the Blackwell card should be in here as well.

RTX Pro is still missing in the AI Enterprise docs. Seen tjis as well but the procedure is the same and BAR1 is 128GB so we need 2x BAR1 x 2GPUs. Will arrange the docs change

1 Like