Issue in vGPU setup in Ubuntu 20.04.3

Currently, I am trying to set up "vGPU " in Ubuntu following Virtual GPU Software User Guide :: NVIDIA Virtual GPU Software Documentation.

GPU model : “NVIDIA RTX A6000”
OS: “Ubuntu 20.04.3 LTS (Focal Fossa)”
Kernel: “5.11.0-41-generic”
Motherboard model: X12SCA-F

$ lspci -nn | grep -i nvid
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2230] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)

Am able to get output for nvidia-smi vgpu command execution

$ nvidia-smi vgpu
Tue Dec 21 16:11:34 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82                 Driver Version: 470.82                    |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA RTX A6000           | 00000000:01:00.0             |   0%       |
+---------------------------------+------------------------------+------------+
$ nvidia-smi -q
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : SR-IOV
$lsmod | grep -i nvidia
nvidia_vgpu_vfio       57344  0
nvidia              35319808  11
mdev                   28672  2 vfio_mdev,nvidia_vgpu_vfio
drm                   548864  12 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,i915,ttm

But while trying to create vGPU through /usr/lib/nvidia/sriov-manage -e ALL no mdev devices are getting created.
As well as not able to see mdev_supported_types directory in /sys/bus/pci/devices/0000:01:00.0 path.

$mdevctl start -u 30820a6f-b1a5-4503-91ca-0c10ba58692a -p 0000:01:00.0 --type nvidia-63
Parent 0000:01:00.0 is not currently registered for mdev support

How to get mdev_supported_types, let me know if any missing modules.

Please post the full output from nvidia-smi -q
I assume you run the A6000 in the wrong mode. Keep in mind that this is a workstation GPU and needs to be switched into DC mode with modeselector tool to make it work with vGPU!

hope this helps

==============NVSMI LOG==============

Timestamp                                 : Tue Dec 21 16:11:41 2021
Driver Version                            : 470.82
CUDA Version                              : Not Found

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA RTX A6000
    Product Brand                         : NVIDIA RTX
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Enabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1320521008011
    GPU UUID                              : GPU-36067b8b-2c19-2693-482a-5da8b89fa917
    Minor Number                          : 0
    VBIOS Version                         : 94.02.5C.00.02
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : 900-5G133-2200-000
    Module ID                             : 0
    Inforom Version
        Image Version                     : G133.0500.00.05
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : SR-IOV
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x223010DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x145910DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 30 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 48685 MiB
        Used                              : 0 MiB
        Free                              : 48685 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 1 MiB
        Free                              : 255 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 35 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 31.37 W
        Power Limit                       : 300.00 W
        Default Power Limit               : 300.00 W
        Enforced Power Limit              : 300.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1800 MHz
        Memory                            : 8001 MHz
    Default Applications Clocks
        Graphics                          : 1800 MHz
        Memory                            : 8001 MHz
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 8001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 750.000 mV
    Processes                             : None

@sschaber ,
Do I need to follow gpumodeswitch User Guide :: NVIDIA Virtual GPU Software Documentation for “NVIDIA RTX A6000”. Because in this page I came across supported products as “Tesla M60” and " Tesla M6"

Hi, not GPUModeswitch but mode selector tool. Totally different story.
As expected your GPU is running in wrong mode. You need the BIG BAR (64GB) size instead of 256MB

BAR1 Memory Usage
        Total                             : **256 MiB**
        Used                              : 1 MiB
        Free                              : 255 MiB


You need to download modeselector tool here: https://developer.nvidia.com/displaymodeselector

@sschaber ,

I want to create “C-series Compute-intensive” vGPUs, for that, I need to set “Physical Display Ports Disabled” .

currently I see the following options.

 ./displaymodeselector --gpumode

NVIDIA Display Mode Selector Utility (Version 1.48.0)
Copyright (C) 2015-2020, NVIDIA Corporation. All Rights Reserved.


WARNING: This operation updates the firmware on the board and could make
         the device unusable if your host system lacks the necessary support.

Are you sure you want to continue?
Press 'y' to confirm (any other key to abort):
y
Select a number:
<0> physical_display_enabled_256MB_bar1
<1> physical_display_disabled
<2> physical_display_enabled_8GB_bar1

Also, I don’t see that “(64GB)” option

physical_display_disabled is the right option. Check nvidia-smi -q afterwards and you will see the BIG BAR1 size :)

1 Like

@sschaber I have uninstalled vGPU manager and executed ./displaymodeselector --gpumode compute and received the below message.

NOTE:
Preserving straps from original image.
Executing automatic disable of EEPROM write protect...
Remove EEPROM write protect complete.
Storing updated firmware image...
[==================================================] 100 %
Verifying update...
Update successful.

Firmware image updated.
 - New version: 94.02.5C.00.02
 - Old version: 94.02.5C.00.02

InfoROM image updated.
 - New version: G133.0500.00.05
 - Old version: G133.0500.00.05

Setting EEPROM software protect setting...
Setting EEPROM protection complete.
Successfully updated GPU mode to "physical_display_disabled" ( Mode 4 ).
A reboot is required for the update to take effect.

Currently, server is stuck in PCI Bus Enumeration for quite a long time and not booting up ;( does it take quite a long time.

Not sure if the message above is related to mode selector. What hardware (OEM) are you using? I assume/hope you have a onboard GPU for serving as primary GPU. Otherwise you won’t be able to boot anymore. You always need to have a second GPU (onboard GPU) to serve as the primary display.

@sschaber
Yes, I am using X12SCA-F supermicro motherboard. Supermicro X12SCA-F Motherboard ATX Single Socket LGA-1200 (Socket H5) for Intel Xeon W-1200 Processors | Wiredzone.

Also, I am using IPMI to login system and no monitors are connected to actual hdmi/vga port.

OK, sounds good. SR-IOV also needs to be enabled in BIOS to support the Ampere GPU in datacenter mode.
Are you still stuck or can you boot properly?

@sschaber Yes, SR-IOV is already enabled in BIOS and still it’s stuck in the same DXE--PCI Bus Enumeration screen. Not even boot selection etc… are getting displayed.

Already tried power off/power on from IPMI but no luck.

@sschaber
Looks EFI setting being chosen for PCIe/PCI bios setting and made it to legacy. Now system booted successfully & I was able to create the mdev devices following Virtual GPU Software User Guide :: NVIDIA Virtual GPU Software Documentation

Nice to hear it’s finally working!

@sschaber
After attaching vGPU through virsh edit command, while booting vm its displaying Failed to set iommu for container: Invalid argument nvidia

Do we need to add any specific grub or module in the VM?

Below is the detail from hypervisor HOST with vGPU manager installed.

/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# lspci -s 01:00.0 -k
01:00.0 3D controller: NVIDIA Corporation Device 2230 (rev a1)
	Subsystem: NVIDIA Corporation Device 1459
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia

Also, can you please share if any specific BIOS setting reference for the Hypervisor host?

Hi, unfortuantely we cannot provide any details on the BIOS settings necessary as this depends on the OEM. They need to provide the settings required. But you could check this (maybe also relevant in your case):
https://enterprise-support.nvidia.com/s/article/PCIe-AER-Advanced-Error-Reporting-and-ACS-Access-Control-Services-BIOS-Settings-for-vGPUs-that-Support-SR-IOV

@sschaber ,

While I execute /usr/lib/nvidia/sriov-manage -e ALL found below in syslog. Any inputs on this to resolve or can be ignored?

kernel: [ 5075.335165] NVRM: Aborting probe for VF 0000:01:01.4 since PF is not bound to nvidia driver.
kernel: [ 5075.335167] nvidia: probe of 0000:01:01.4 failed with error -1
kernel: [ 5075.335172] pci-pf-stub 0000:01:01.4: claimed by pci-pf-stub
kernel: [ 5075.335232] pci 0000:01:01.5: [10de:2230] type 00 class 0x030200
libvirtd[1187]: libvirt version: 6.0.0, package: 0ubuntu8.15 (Christian Ehrhardt <christian.ehrhardt@canonical.com> Thu, 18 Nov 2021 10:23:11 +0100)
libvirtd[1187]: internal error: Unknown PCI header type '127' for device '0000:01:01.0'
libvirtd[1187]: internal error: Unknown PCI header type '127' for device '0000:01:01.1'
libvirtd[1187]: internal error: Unknown PCI header type '127' for device '0000:01:01.0'
libvirtd[1187]: internal error: Unknown PCI header type '127' for device '0000:01:01.1'
libvirtd[1187]: internal error: Unknown PCI header type '127' for device '0000:01:01.2'
libvirtd[1187]: internal error: Unknown PCI header type '127' for device '0000:01:01.2'

@sschaber ,
For A6000 GPU-related issue, I have reached OEM for BIOS settings and waiting for a reply.
Based on Virtual GPU Software User Guide :: NVIDIA Virtual GPU Software Documentation document, I would like to confirm similar to A6000 for A5000 GPU we need to change the GPU mode using “displaymodeselector tool” right.
I tried using displaymodeselector tool but end up with a error message saying Specified GPU mode not supported on this device 0x2231.
Requesting for your guidance.

You’re right. A5000 also needs to be changed into DC mode with modeselector tool. These are the 2 workstation GPUs that can be used for vGPU after mode change

@sschaber
Thanks for the confirmation.
as mentioned in the message it’s not working out.

./displaymodeselector --listgpumodes
NVIDIA Display Mode Selector Utility (Version 1.48.0)
Copyright (C) 2015-2020, NVIDIA Corporation. All Rights Reserved.
Adapter: Graphics Device      (10DE,xxxx,10DE,xxxx) S:00,B:01,D:00,F:00

EEPROM ID (EF,6015) : WBond W25Q16FW/JW 1.65-1.95V 16384Kx1S, page
GPU Mode: N/A

But in this A5000 GPU server, the host OS is Centos Stream 8. Do that matter or suspect something else?