I can't boot VM with vGPU

Host : vSphere 8.0a-20842819/Tesla P4
Guest : Windows 10 x64 (1809/21H2 Tested)
NVIDIA-GRID-vSphere-8.0-525.85.07-525.85.05-528.24, NVIDIA-GRID-vSphere-8.0-525.60.12-525.60.13-527.41 Tested same result

All work well without vGPU (Even when i passthrough Quadro) but can’t work with Grid vGPU

When i add Grid vGPU it seem OK but If i try to install VGA Driver on Guest OS It stop to work and endless BSOD (Video TDR Failure) loop

Test lastest version → Remove Driver and MGMT from host&Reboot → Test Older version

nvidia-smi -q
==============NVSMI LOG==============
Timestamp                                 : Thu Feb  9 14:09:29 2023
Driver Version                            : 525.60.12
CUDA Version                              : Not Found
vGPU Driver Capability
        Heterogenous Multi-vGPU           : Supported

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : Tesla P4
    Product Brand                         : Tesla
    Product Architecture                  : Pascal
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    vGPU Device Capability
        Fractional Multi-vGPU             : Not Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Enabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0325017068044
    GPU UUID                              : GPU-089680be-cd9e-dd3e-d0d8-3dd173e36851
    Minor Number                          : 0
    VBIOS Version                         : 86.04.55.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : 900-2G414-0000-000
    GPU Part Number                       : 1BB3-895-A1
    Module ID                             : 0
    Inforom Version
        Image Version                     : G414.0200.00.03
        OEM Object                        : 1.1
        ECC Object                        : 4.1
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : Non SR-IOV
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1BB310DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x11D810DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
                Device Current            : 1
                Device Max                : 3
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 8192 MiB
        Reserved                          : 0 MiB
        Used                              : 28 MiB
        Free                              : 8163 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 47 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 11.60 W
        Power Limit                       : 75.00 W
        Default Power Limit               : 75.00 W
        Enforced Power Limit              : 75.00 W
        Min Power Limit                   : 60.00 W
        Max Power Limit                   : 75.00 W
    Clocks
        Graphics                          : 455 MHz
        SM                                : 455 MHz
        Memory                            : 405 MHz
        Video                             : 455 MHz
    Applications Clocks
        Graphics                          : 885 MHz
        Memory                            : 3003 MHz
    Default Applications Clocks
        Graphics                          : 885 MHz
        Memory                            : 3003 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1531 MHz
        SM                                : 1531 MHz
        Memory                            : 3003 MHz
        Video                             : 1379 MHz
    Max Customer Boost Clocks
        Graphics                          : 1113 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes                             : None

Current Status

Hi,

could you please share a few more details? Which hardware are you using? Is the hardware on the certified servers list for vGPU?

I’m asking as the P4 is a pretty old GPU and I haven’t seen it for a while in new projects.

Best regards
Simon

It test server (Not certified hardware)

Using comet lake (i5-10600 + RAM 128G + Old Intel AIC SSD)

Also P4 is test purpose (If it work well I will use turing or ampare)

I want to service small (1~20 client) VDI service(by RDP) for Internal (use Server + Thin Client) )

I want to check performance and delay, power usage, etc… before buying hardware for service

I will run 8 VMs for test (4 VMs with vGPU&4 VMs without vGPU) (and compare 2 group)




If i want to use vGPU I need certificated hardware? or not? (Is this work only with certificated board?)

It might work on “not certified hardware” but it might be pretty hard to give you proper advise in case of issues with unsupported hardware as we always have dependency from hardware.