Hello bros, Please help me,
In RHEL 8.6 OS, GPU = A5000.
nvidia-smi
lsmod | grep vfio
But, in cd /sys/class Can’t found mdev_bus directory
Please help me and God bless you!
Hello bros, Please help me,
In RHEL 8.6 OS, GPU = A5000.
nvidia-smi
lsmod | grep vfio
But, in cd /sys/class Can’t found mdev_bus directory
Please help me and God bless you!
Do you have SR-IOV enabled?
You need to run /usr/lib/nvidia/sriov-manage -e ALL
I have the same problem on RHEL 9.
SR-IOV enabled, /sriov-manage -e ALL ran.
I don’t know if the reason might be below but I’ve compared RHEL 9 to Alma linux 8.6 (that was the previous version of vGPU I ran) and have found the following:
RHEL 9:lsmod | grep vfio
nvidia_vgpu_vfio 65536 0
mdev 32768 1 nvidia_vgpu_vfio
vfio_iommu_type1 45056 0
vfio 45056 3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev
Alma 8.6:lsmod | grep vfio
nvidia_vgpu_vfio 27099 0
nvidia 12316924 1 nvidia_vgpu_vfio
vfio_mdev 12841 0
mdev 20414 2 vfio_mdev,nvidia_vgpu_vfio
vfio_iommu_type1 22342 0
vfio 32331 3
vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1
If you compare the two vfio_mdev is missing in RHEL.
We are talking to different version of the OS (haven’t ran Alma 9)…
Any ideas?
Thanks.
I’m having the same issue on Oracle Linux 9 on kernel 5.14.0-70.30.1.0.1.el9_0.x86_64 with A5000
SR-IOV is enabled in the BIOS and I ran /usr/lib/nvidia/sriov-manage -e ALL
lsmod | grep vfio
nvidia_vgpu_vfio 65536 0
mdev 32768 1 nvidia_vgpu_vfio
vfio_iommu_type1 45056 0
vfio 45056 3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev
Any Ideas?
I found the solution, the issue is with the A5000 having It’s physical display-ports enabled.
“Some supported NVIDIA GPUs don’t have vGPU enabled out of the box and need to have their display ports disabled. This is the case with our RTX A5000, and can be achieved by using their display mode selector tool”
./displaymodeselector --gpumode
It’s an interactive prompt and basically you need to select “physical_display_disabled” and then choose for which gpus, after that when executing /usr/lib/nvidia/sriov-manage -e ALL you should have some output, then reboot. I created a crontab entry so that this is executed on reboot, like so.
@reboot root /usr/lib/nvidia/sriov-manage -e ALL
But you can also create a systemd unit to do this.
Here are a bit more details regarding this issue.
https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x
Greetings :)
Hi,
just to add, this is also documented here: https://forums.developer.nvidia.com/uploads/short-url/wgqrloFXITvrWtGMI0QAMQVaWyD.pdf
These are workstation GPUs and therefore not enabled by default for virtualization.
regards
Simon
I was able to program the card for virtualization, now I have weird issue I create 4 mdev of 6g each and I’m getting some weird issues,
The devices on the screenshot that are as (rev a1) work, but the ones as (rev ff) don’t work.
On the (rev a1) lspci looks good
lspci -v -s 41:00.0
41:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
Subsystem: NVIDIA Corporation Device 147e
Flags: bus master, fast devsel, latency 0, IRQ 268, IOMMU group 45
Memory at b2000000 (32-bit, non-prefetchable) [size=16M]
Memory at 26800000000 (64-bit, prefetchable) [size=32G]
Memory at 27c30000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] Null
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?>
Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia
But on the (rev ff) I get the following error
lspci -vv -s 41:01.0
41:01.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia
I did a bit of research and seems to be issue with power or thermal apparently but the card is idle, not doing much, plugged in a chassis in a data center.
Hi folks,
I am having the same issue on “NVIDIA A100 80GB PCIe”. The mdev_bus directory hasn’t been created.
SR-IOV and IOMMU are enabled as well.
sudo /usr/lib/nvidia/sriov-manage -e ALL
Enabling VFs on 0000:01:00.0
Cannot obtain unbindLock for 0000:01:00.0
lsmod | grep vfio
nvidia_vgpu_vfio 53248 0
vfio_mdev 16384 0
mdev 24576 2 vfio_mdev,nvidia_vgpu_vfio
nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Mon Jan 9 17:57:21 2023
Driver Version : 510.108.03
CUDA Version : Not Found
Attached GPUs : 4
GPU 00000000:01:00.0
Product Name : NVIDIA A100 80GB PCIe
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : Enabled
Pending : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Would it be the problem due to the Display mode enabled?
If yes, how can I disable it? As the A100 PCIs is not in the list of displaymodeselector supported GPUs?
Thank you
Hi everyone!
If you have a problem with (rev ff), you also need to enable ACS and ARI in your BIOS. In my case of ASUS PRIME X570-P BIOS 4408:
Advanced > AMD CBS > NBIO Common Options > ACS Enable
Advanced > AMD CBS > NBIO Common Options > PCIe ARI Support
I meet the same problem.Is there any solution?
I meet the same problem. After enabling sriov, it works.
But must A100 enable sriov to use grid vgpu?
Hi,
I also meet the same problemon “NVIDIA A100 80GB PCIe”,.
I am having trouble using vGPU on Ubuntu KVM.
Is there any solution?