/sys/class/mdev_bus/ Can't Found

Hello bros, Please help me,
In RHEL 8.6 OS, GPU = A5000.

nvidia-smi

lsmod | grep vfio

But, in cd /sys/class Can’t found mdev_bus directory

Please help me and God bless you!

Do you have SR-IOV enabled?
You need to run /usr/lib/nvidia/sriov-manage -e ALL

I have the same problem on RHEL 9.
SR-IOV enabled, /sriov-manage -e ALL ran.

I don’t know if the reason might be below but I’ve compared RHEL 9 to Alma linux 8.6 (that was the previous version of vGPU I ran) and have found the following:

RHEL 9:lsmod | grep vfio
nvidia_vgpu_vfio 65536 0
mdev 32768 1 nvidia_vgpu_vfio
vfio_iommu_type1 45056 0
vfio 45056 3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev

Alma 8.6:lsmod | grep vfio
nvidia_vgpu_vfio 27099 0
nvidia 12316924 1 nvidia_vgpu_vfio
vfio_mdev 12841 0
mdev 20414 2 vfio_mdev,nvidia_vgpu_vfio
vfio_iommu_type1 22342 0
vfio 32331 3
vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1

If you compare the two vfio_mdev is missing in RHEL.
We are talking to different version of the OS (haven’t ran Alma 9)…

Any ideas?
Thanks.

I’m having the same issue on Oracle Linux 9 on kernel 5.14.0-70.30.1.0.1.el9_0.x86_64 with A5000

SR-IOV is enabled in the BIOS and I ran /usr/lib/nvidia/sriov-manage -e ALL

lsmod | grep vfio

nvidia_vgpu_vfio 65536 0
mdev 32768 1 nvidia_vgpu_vfio
vfio_iommu_type1 45056 0
vfio 45056 3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev

Any Ideas?

I found the solution, the issue is with the A5000 having It’s physical display-ports enabled.

“Some supported NVIDIA GPUs don’t have vGPU enabled out of the box and need to have their display ports disabled. This is the case with our RTX A5000, and can be achieved by using their display mode selector tool”

./displaymodeselector --gpumode

It’s an interactive prompt and basically you need to select “physical_display_disabled” and then choose for which gpus, after that when executing /usr/lib/nvidia/sriov-manage -e ALL you should have some output, then reboot. I created a crontab entry so that this is executed on reboot, like so.

@reboot root /usr/lib/nvidia/sriov-manage -e ALL

But you can also create a systemd unit to do this.

Here are a bit more details regarding this issue.

https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x

Greetings :)

Hi,

just to add, this is also documented here: https://forums.developer.nvidia.com/uploads/short-url/wgqrloFXITvrWtGMI0QAMQVaWyD.pdf

These are workstation GPUs and therefore not enabled by default for virtualization.

regards
Simon

I was able to program the card for virtualization, now I have weird issue I create 4 mdev of 6g each and I’m getting some weird issues,

The devices on the screenshot that are as (rev a1) work, but the ones as (rev ff) don’t work.

On the (rev a1) lspci looks good

lspci -v -s 41:00.0
41:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
Subsystem: NVIDIA Corporation Device 147e
Flags: bus master, fast devsel, latency 0, IRQ 268, IOMMU group 45
Memory at b2000000 (32-bit, non-prefetchable) [size=16M]
Memory at 26800000000 (64-bit, prefetchable) [size=32G]
Memory at 27c30000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] Null
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?> Capabilities: [c8] MSI-X: Enable+ Count=6 Masked- Capabilities: [100] Virtual Channel Capabilities: [258] L1 PM Substates Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] Secondary PCI Express Capabilities: [bb0] Physical Resizable BAR Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV) Capabilities: [c14] Alternative Routing-ID Interpretation (ARI) Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?> Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia

But on the (rev ff) I get the following error

lspci -vv -s 41:01.0
41:01.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia

I did a bit of research and seems to be issue with power or thermal apparently but the card is idle, not doing much, plugged in a chassis in a data center.

Hi folks,
I am having the same issue on “NVIDIA A100 80GB PCIe”. The mdev_bus directory hasn’t been created.
SR-IOV and IOMMU are enabled as well.

 sudo /usr/lib/nvidia/sriov-manage -e ALL
Enabling VFs on 0000:01:00.0
Cannot obtain unbindLock for 0000:01:00.0
 lsmod | grep vfio
nvidia_vgpu_vfio       53248  0
vfio_mdev              16384  0
mdev                   24576  2 vfio_mdev,nvidia_vgpu_vfio
 nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Mon Jan  9 17:57:21 2023
Driver Version                            : 510.108.03
CUDA Version                              : Not Found

Attached GPUs                             : 4
GPU 00000000:01:00.0
    Product Name                          : NVIDIA A100 80GB PCIe
    Product Brand                         : NVIDIA
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : Enabled
        Pending                           : Enabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A

Would it be the problem due to the Display mode enabled?
If yes, how can I disable it? As the A100 PCIs is not in the list of displaymodeselector supported GPUs?

Thank you

Hi everyone!

If you have a problem with (rev ff), you also need to enable ACS and ARI in your BIOS. In my case of ASUS PRIME X570-P BIOS 4408:
Advanced > AMD CBS > NBIO Common Options > ACS Enable
Advanced > AMD CBS > NBIO Common Options > PCIe ARI Support