Currently, I am trying to set up "vGPU " in Ubuntu following Virtual GPU Software User Guide :: NVIDIA Virtual GPU Software Documentation.
GPU model : “NVIDIA RTX A6000”
OS: “Ubuntu 20.04.3 LTS (Focal Fossa)”
Kernel: “5.11.0-41-generic”
Motherboard model: X12SCA-F
$ lspci -nn | grep -i nvid
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2230] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)
Am able to get output for nvidia-smi vgpu
command execution
$ nvidia-smi vgpu
Tue Dec 21 16:11:34 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82 Driver Version: 470.82 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 NVIDIA RTX A6000 | 00000000:01:00.0 | 0% |
+---------------------------------+------------------------------+------------+
$ nvidia-smi -q
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
$lsmod | grep -i nvidia
nvidia_vgpu_vfio 57344 0
nvidia 35319808 11
mdev 28672 2 vfio_mdev,nvidia_vgpu_vfio
drm 548864 12 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,i915,ttm
But while trying to create vGPU through /usr/lib/nvidia/sriov-manage -e ALL
no mdev devices are getting created.
As well as not able to see mdev_supported_types directory in /sys/bus/pci/devices/0000:01:00.0
path.
$mdevctl start -u 30820a6f-b1a5-4503-91ca-0c10ba58692a -p 0000:01:00.0 --type nvidia-63
Parent 0000:01:00.0 is not currently registered for mdev support
How to get mdev_supported_types, let me know if any missing modules.
Please post the full output from nvidia-smi -q
I assume you run the A6000 in the wrong mode. Keep in mind that this is a workstation GPU and needs to be switched into DC mode with modeselector tool to make it work with vGPU!
hope this helps
==============NVSMI LOG==============
Timestamp : Tue Dec 21 16:11:41 2021
Driver Version : 470.82
CUDA Version : Not Found
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA RTX A6000
Product Brand : NVIDIA RTX
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1320521008011
GPU UUID : GPU-36067b8b-2c19-2693-482a-5da8b89fa917
Minor Number : 0
VBIOS Version : 94.02.5C.00.02
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : 900-5G133-2200-000
Module ID : 0
Inforom Version
Image Version : G133.0500.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x223010DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x145910DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 30 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 48685 MiB
Used : 0 MiB
Free : 48685 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 1 MiB
Free : 255 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 35 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 31.37 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1800 MHz
Memory : 8001 MHz
Default Applications Clocks
Graphics : 1800 MHz
Memory : 8001 MHz
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 8001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 750.000 mV
Processes : None
@sschaber ,
Do I need to follow gpumodeswitch User Guide :: NVIDIA Virtual GPU Software Documentation for “NVIDIA RTX A6000”. Because in this page I came across supported products as “Tesla M60” and " Tesla M6"
Hi, not GPUModeswitch but mode selector tool. Totally different story.
As expected your GPU is running in wrong mode. You need the BIG BAR (64GB) size instead of 256MB
BAR1 Memory Usage
Total : **256 MiB**
Used : 1 MiB
Free : 255 MiB
You need to download modeselector tool here: https://developer.nvidia.com/displaymodeselector
@sschaber ,
I want to create “C-series Compute-intensive” vGPUs, for that, I need to set “Physical Display Ports Disabled” .
currently I see the following options.
./displaymodeselector --gpumode
NVIDIA Display Mode Selector Utility (Version 1.48.0)
Copyright (C) 2015-2020, NVIDIA Corporation. All Rights Reserved.
WARNING: This operation updates the firmware on the board and could make
the device unusable if your host system lacks the necessary support.
Are you sure you want to continue?
Press 'y' to confirm (any other key to abort):
y
Select a number:
<0> physical_display_enabled_256MB_bar1
<1> physical_display_disabled
<2> physical_display_enabled_8GB_bar1
Also, I don’t see that “(64GB)” option
physical_display_disabled is the right option. Check nvidia-smi -q afterwards and you will see the BIG BAR1 size :)
1 Like
@sschaber I have uninstalled vGPU manager and executed ./displaymodeselector --gpumode compute
and received the below message.
NOTE:
Preserving straps from original image.
Executing automatic disable of EEPROM write protect...
Remove EEPROM write protect complete.
Storing updated firmware image...
[==================================================] 100 %
Verifying update...
Update successful.
Firmware image updated.
- New version: 94.02.5C.00.02
- Old version: 94.02.5C.00.02
InfoROM image updated.
- New version: G133.0500.00.05
- Old version: G133.0500.00.05
Setting EEPROM software protect setting...
Setting EEPROM protection complete.
Successfully updated GPU mode to "physical_display_disabled" ( Mode 4 ).
A reboot is required for the update to take effect.
Currently, server is stuck in PCI Bus Enumeration
for quite a long time and not booting up ;( does it take quite a long time.
Not sure if the message above is related to mode selector. What hardware (OEM) are you using? I assume/hope you have a onboard GPU for serving as primary GPU. Otherwise you won’t be able to boot anymore. You always need to have a second GPU (onboard GPU) to serve as the primary display.
@sschaber
Yes, I am using X12SCA-F supermicro motherboard. Supermicro X12SCA-F Motherboard ATX Single Socket LGA-1200 (Socket H5) for Intel Xeon W-1200 Processors | Wiredzone.
Also, I am using IPMI to login system and no monitors are connected to actual hdmi/vga port.
OK, sounds good. SR-IOV also needs to be enabled in BIOS to support the Ampere GPU in datacenter mode.
Are you still stuck or can you boot properly?
@sschaber Yes, SR-IOV is already enabled in BIOS and still it’s stuck in the same DXE--PCI Bus Enumeration
screen. Not even boot selection etc… are getting displayed.
Already tried power off/power on from IPMI but no luck.
@sschaber
Looks EFI setting being chosen for PCIe/PCI bios setting and made it to legacy. Now system booted successfully & I was able to create the mdev devices following Virtual GPU Software User Guide :: NVIDIA Virtual GPU Software Documentation
Nice to hear it’s finally working!
@sschaber
After attaching vGPU through virsh edit
command, while booting vm its displaying Failed to set iommu for container: Invalid argument nvidia
Do we need to add any specific grub or module in the VM?
Below is the detail from hypervisor HOST with vGPU manager installed.
/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# lspci -s 01:00.0 -k
01:00.0 3D controller: NVIDIA Corporation Device 2230 (rev a1)
Subsystem: NVIDIA Corporation Device 1459
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
Also, can you please share if any specific BIOS setting reference for the Hypervisor host?
Hi, unfortuantely we cannot provide any details on the BIOS settings necessary as this depends on the OEM. They need to provide the settings required. But you could check this (maybe also relevant in your case):
https://enterprise-support.nvidia.com/s/article/PCIe-AER-Advanced-Error-Reporting-and-ACS-Access-Control-Services-BIOS-Settings-for-vGPUs-that-Support-SR-IOV
@sschaber ,
While I execute /usr/lib/nvidia/sriov-manage -e ALL
found below in syslog. Any inputs on this to resolve or can be ignored?
kernel: [ 5075.335165] NVRM: Aborting probe for VF 0000:01:01.4 since PF is not bound to nvidia driver.
kernel: [ 5075.335167] nvidia: probe of 0000:01:01.4 failed with error -1
kernel: [ 5075.335172] pci-pf-stub 0000:01:01.4: claimed by pci-pf-stub
kernel: [ 5075.335232] pci 0000:01:01.5: [10de:2230] type 00 class 0x030200
libvirtd[1187]: libvirt version: 6.0.0, package: 0ubuntu8.15 (Christian Ehrhardt <christian.ehrhardt@canonical.com> Thu, 18 Nov 2021 10:23:11 +0100)
libvirtd[1187]: internal error: Unknown PCI header type '127' for device '0000:01:01.0'
libvirtd[1187]: internal error: Unknown PCI header type '127' for device '0000:01:01.1'
libvirtd[1187]: internal error: Unknown PCI header type '127' for device '0000:01:01.0'
libvirtd[1187]: internal error: Unknown PCI header type '127' for device '0000:01:01.1'
libvirtd[1187]: internal error: Unknown PCI header type '127' for device '0000:01:01.2'
libvirtd[1187]: internal error: Unknown PCI header type '127' for device '0000:01:01.2'
@sschaber ,
For A6000 GPU-related issue, I have reached OEM for BIOS settings and waiting for a reply.
Based on Virtual GPU Software User Guide :: NVIDIA Virtual GPU Software Documentation document, I would like to confirm similar to A6000 for A5000 GPU we need to change the GPU mode using “displaymodeselector tool” right.
I tried using displaymodeselector tool but end up with a error message saying Specified GPU mode not supported on this device 0x2231.
Requesting for your guidance.
You’re right. A5000 also needs to be changed into DC mode with modeselector tool. These are the 2 workstation GPUs that can be used for vGPU after mode change
@sschaber
Thanks for the confirmation.
as mentioned in the message it’s not working out.
./displaymodeselector --listgpumodes
NVIDIA Display Mode Selector Utility (Version 1.48.0)
Copyright (C) 2015-2020, NVIDIA Corporation. All Rights Reserved.
Adapter: Graphics Device (10DE,xxxx,10DE,xxxx) S:00,B:01,D:00,F:00
EEPROM ID (EF,6015) : WBond W25Q16FW/JW 1.65-1.95V 16384Kx1S, page
GPU Mode: N/A
But in this A5000 GPU server, the host OS is Centos Stream 8. Do that matter or suspect something else?