Hello bros, Please help me,
In RHEL 8.6 OS, GPU = A5000.
nvidia-smi
lsmod | grep vfio
But, in cd /sys/class Can’t found mdev_bus directory
Please help me and God bless you!
Hello bros, Please help me,
In RHEL 8.6 OS, GPU = A5000.
nvidia-smi
lsmod | grep vfio
But, in cd /sys/class Can’t found mdev_bus directory
Please help me and God bless you!
Do you have SR-IOV enabled?
You need to run /usr/lib/nvidia/sriov-manage -e ALL
I have the same problem on RHEL 9.
SR-IOV enabled, /sriov-manage -e ALL ran.
I don’t know if the reason might be below but I’ve compared RHEL 9 to Alma linux 8.6 (that was the previous version of vGPU I ran) and have found the following:
RHEL 9:lsmod | grep vfio
nvidia_vgpu_vfio 65536 0
mdev 32768 1 nvidia_vgpu_vfio
vfio_iommu_type1 45056 0
vfio 45056 3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev
Alma 8.6:lsmod | grep vfio
nvidia_vgpu_vfio 27099 0
nvidia 12316924 1 nvidia_vgpu_vfio
vfio_mdev 12841 0
mdev 20414 2 vfio_mdev,nvidia_vgpu_vfio
vfio_iommu_type1 22342 0
vfio 32331 3
vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1
If you compare the two vfio_mdev is missing in RHEL.
We are talking to different version of the OS (haven’t ran Alma 9)…
Any ideas?
Thanks.
I’m having the same issue on Oracle Linux 9 on kernel 5.14.0-70.30.1.0.1.el9_0.x86_64 with A5000
SR-IOV is enabled in the BIOS and I ran /usr/lib/nvidia/sriov-manage -e ALL
lsmod | grep vfio
nvidia_vgpu_vfio 65536 0
mdev 32768 1 nvidia_vgpu_vfio
vfio_iommu_type1 45056 0
vfio 45056 3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev
Any Ideas?
I found the solution, the issue is with the A5000 having It’s physical display-ports enabled.
“Some supported NVIDIA GPUs don’t have vGPU enabled out of the box and need to have their display ports disabled. This is the case with our RTX A5000, and can be achieved by using their display mode selector tool”
./displaymodeselector --gpumode
It’s an interactive prompt and basically you need to select “physical_display_disabled” and then choose for which gpus, after that when executing /usr/lib/nvidia/sriov-manage -e ALL you should have some output, then reboot. I created a crontab entry so that this is executed on reboot, like so.
@reboot root /usr/lib/nvidia/sriov-manage -e ALL
But you can also create a systemd unit to do this.
Here are a bit more details regarding this issue.
https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x
Greetings :)
Hi,
just to add, this is also documented here: https://forums.developer.nvidia.com/uploads/short-url/wgqrloFXITvrWtGMI0QAMQVaWyD.pdf
These are workstation GPUs and therefore not enabled by default for virtualization.
regards
Simon
I was able to program the card for virtualization, now I have weird issue I create 4 mdev of 6g each and I’m getting some weird issues,
The devices on the screenshot that are as (rev a1) work, but the ones as (rev ff) don’t work.
On the (rev a1) lspci looks good
lspci -v -s 41:00.0
41:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
Subsystem: NVIDIA Corporation Device 147e
Flags: bus master, fast devsel, latency 0, IRQ 268, IOMMU group 45
Memory at b2000000 (32-bit, non-prefetchable) [size=16M]
Memory at 26800000000 (64-bit, prefetchable) [size=32G]
Memory at 27c30000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] Null
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?>
Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia
But on the (rev ff) I get the following error
lspci -vv -s 41:01.0
41:01.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia
I did a bit of research and seems to be issue with power or thermal apparently but the card is idle, not doing much, plugged in a chassis in a data center.
Hi folks,
I am having the same issue on “NVIDIA A100 80GB PCIe”. The mdev_bus directory hasn’t been created.
SR-IOV and IOMMU are enabled as well.
sudo /usr/lib/nvidia/sriov-manage -e ALL
Enabling VFs on 0000:01:00.0
Cannot obtain unbindLock for 0000:01:00.0
lsmod | grep vfio
nvidia_vgpu_vfio 53248 0
vfio_mdev 16384 0
mdev 24576 2 vfio_mdev,nvidia_vgpu_vfio
nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Mon Jan 9 17:57:21 2023
Driver Version : 510.108.03
CUDA Version : Not Found
Attached GPUs : 4
GPU 00000000:01:00.0
Product Name : NVIDIA A100 80GB PCIe
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Enabled
Pending : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Would it be the problem due to the Display mode enabled?
If yes, how can I disable it? As the A100 PCIs is not in the list of displaymodeselector supported GPUs?
Thank you
Hi everyone!
If you have a problem with (rev ff), you also need to enable ACS and ARI in your BIOS. In my case of ASUS PRIME X570-P BIOS 4408:
Advanced > AMD CBS > NBIO Common Options > ACS Enable
Advanced > AMD CBS > NBIO Common Options > PCIe ARI Support
I meet the same problem.Is there any solution?
I meet the same problem. After enabling sriov, it works.
But must A100 enable sriov to use grid vgpu?
Hi,
I also meet the same problemon “NVIDIA A100 80GB PCIe”,.
I am having trouble using vGPU on Ubuntu KVM.
Is there any solution?
I am having the same problem with an A16, i followed the Virtual GPU Software Documentation for installing on Linux KVM and when i run sriov-manage -e ALL i do not see mdev_bus.
Running on Ubuntu 22.04.3 LTS
SR-IOV enabled in BIOS
Motherboard is X11DPG-OT-CPU
Result for one GPU when running nvidia-smi -q
root@p02r99srv10:~# nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Thu Sep 28 09:03:41 2023
Driver Version : 535.104.06
CUDA Version : 12.2
vGPU Driver Capability
Heterogenous Multi-vGPU : Supported
Attached GPUs : 4
GPU 00000000:41:00.0
Product Name : NVIDIA A16
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Not Supported
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Minor Number : 0
VBIOS Version : 94.07.54.00.45
MultiGPU Board : Yes
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G171.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x41
Device : 0x00
Domain : 0x0000
Device Id : 0x25B610DE
Bus Id : 00000000:41:00.0
Sub System Id : 0x14A910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 15356 MiB
Reserved : 258 MiB
Used : 0 MiB
Free : 15097 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 1 MiB
Free : 16383 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 64 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Default Applications Clocks
Graphics : 1755 MHz
Memory : 6251 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1755 MHz
SM : 1755 MHz
Memory : 6251 MHz
Video : 1635 MHz
Max Customer Boost Clocks
Graphics : 1755 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Fabric
State : N/A
Status : N/A
Processes : None
I can see virtfns though
I got it to work by doing a fresh install of Ubuntu.