NVIDIA A40 not shows mdev_supported_types and I can't create vGPUS instances

I have installed the nVIDIA software in Linux release 8.3.2011 with kernel 5.4.107 with T4 and V100 without problems, but when I install nvidia software in a system with A40 card I can’t create vGPUs instances.

I installed NVIDIA-GRID-Linux-KVM-460.32.04-460.32.03-461.33 ok, but when I list /sys/bus/pci/devices/0000:41:00.0 there is not directory mdev_supported_types.

In this path /sys/bus/pci/devices/0000:41:00.0 appears iommu and iommu_group directories and sriov* files that don’t appear in other installations with T4 or V100.

Some ideas? Can you help me?

nvidia-smi output is:

[root@a40 ~]# nvidia-smi
Sun Mar 21 09:15:12 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.04 Driver Version: 460.32.04 CUDA Version: N/A |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A40 On | 00000000:41:00.0 Off | 0 |
| 0% 29C P0 73W / 300W | 0MiB / 45634MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

mode compute is selected

[root@a40 ~]# ./displaymodeselector --listgpumodes

NVIDIA Display Mode Selector Utility (Version 1.48.0)
Copyright (C) 2015-2020, NVIDIA Corporation. All Rights Reserved.

Adapter: Graphics Device (10DE,2235,10DE,145A) S:00,B:41,D:00,F:00

EEPROM ID (EF,6015) : WBond W25Q16FW/JW 1.65-1.95V 16384Kx1S, page

GPU Mode: Compute

[root@a40]# ls /sys/bus/pci/devices/0000:41:00.0
aer_dev_correctable
aer_dev_fatal
aer_dev_nonfatal
ari_enabled
broken_parity_status
class
config
consistent_dma_mask_bits
current_link_speed
current_link_width
d3cold_allowed
device
dma_mask_bits
driver
driver_override
enable
i2c-5
i2c-6
iommu
iommu_group
irq
local_cpulist
local_cpus
max_link_speed
max_link_width
modalias
msi_bus
msi_irqs
numa_node
power
remove
rescan
reset
resource
resource0
resource1
resource1_wc
resource3
resource3_wc
revision
sriov_drivers_autoprobe
sriov_numvfs
sriov_offset
sriov_stride
sriov_totalvfs
sriov_vf_device
subsystem
subsystem_device
subsystem_vendor
uevent
vendor

i had the same problem with the a6000 - the manual does not really point it out that great but you need to do the following

/usr/lib/nvidia/sriov-manage -e 00:D8:0000.0 - in your case it should be a other device id.

once you did this it will create mdevs - i my case it created ~20 devices with device id +1 or so - meaning you cant use your original device id but need to take one of the others.

i have to do this after each boot ( could do it permanent if i wanted too=)

hope this helps you

Thanks Stefan,

your solution has worked!!

After doing “sriov-manage -e” appeared new directories in /sys/bus/pci/devices/$bus

root@a40# bus=$(nvidia-smi -q |grep ^GPU |awk -F " 0000" '{print tolower($2)}')
root@a40# /usr/lib/nvidia/sriov-manage -e $bus
root@a40 # ls /sys/bus/pci/devices/$bus/| grep ^virtfn |wc -l
32

These directories have a different aproach than NVDIa says in his documentation. There is not “mdev_supported_types” directory, but it appears 32 directories “virtfn0” to “virtfn31”.

In each virtfn* directory apperas a mdev_supported_types that contains all models of vGPU availables in this card.

For example:

root@a40# cat "/sys/bus/pci/devices/0000:41:00.0/virtfn0/mdev_supported_types/nvidia-557/name"
NVIDIA A40-1Q
root@a40# cat "/sys/bus/pci/devices/0000:41:00.0/virtfn0/mdev_supported_types/nvidia-557/available_instances" 
1

If you create a mdev device instance:

root@a40# uid=$(uuidgen)
root@a40# echo $uid > "/sys/bus/pci/devices/0000:41:00.0/virtfn0/mdev_supported_types/nvidia-557/create"
root@a40# cat "/sys/bus/pci/devices/0000:41:00.0/virtfn0/mdev_supported_types/nvidia-557/available_instances"
0

And if you create more instances, for example:

root@a40# uid=$(uuidgen)
root@a40# echo $uid > "/sys/bus/pci/devices/0000:41:00.0/virtfn1/mdev_supported_types/nvidia-557/create"

If we have crated 3 instances:

root@a40# ls /sys/bus/mdev/devices/ |wc -l 
3

And for a maximum of 32 instances of type NVIDIA A40-1Q we have 29 instances availables over all the directories:

root@a40# cat /sys/bus/pci/devices/0000\:41\:00.0/virtfn*/mdev_supported_types/nvidia-557/available_instances |grep 1 |wc -l 
29

I will that this strange behaviour could be explained by NVIDIA or solved in next releases of vGPU software.

1 Like

Can you help me I have the same problem with an A5000 the corresponding directory does not exist::

ls /sys/bus/pci/drivers/nvidia/0000\:41\:00.0/mdev_supported_types

Even after running the command:

/usr/lib/nvidia/sriov-manage -e 0000:41:00.0

The /sys/bus/pci/drivers/nvidia/0000:41:00.0 folder exists, but I’m guessing this would not be the A5000 board ID, what can be done about it?

Thanks.

Did you use mode selector tool already? Otherwise the A5000 won’t work! Make sure you know what you are doing and have a second GPU that can serve as primary display device or your machine won’t be able to boot afterwards.

Hi sschaber, how are you?

I’m almost a year late on this question, but I’m still stuck on the same issue, now we have two RTX 6000 Ada’s which I need to enable vGPU support as mentioned before by running the command /usr/lib/nvidia/sriov-manage -e domain:bus:slot.function, no file is generated either, so the solution is to use the “mode selector tool” application, correct?

Can you help me perform the procedure correctly to ensure the integrity of the device, or can you direct me on how we can get support to perform this procedure?

Correct. Mode Selector Tool is required to change into DC mode with full BAR1 size

1 Like