NVIDIA A40 not shows mdev_supported_types and I can't create vGPUS instances

I have installed the nVIDIA software in Linux release 8.3.2011 with kernel 5.4.107 with T4 and V100 without problems, but when I install nvidia software in a system with A40 card I can’t create vGPUs instances.

I installed NVIDIA-GRID-Linux-KVM-460.32.04-460.32.03-461.33 ok, but when I list /sys/bus/pci/devices/0000:41:00.0 there is not directory mdev_supported_types.

In this path /sys/bus/pci/devices/0000:41:00.0 appears iommu and iommu_group directories and sriov* files that don’t appear in other installations with T4 or V100.

Some ideas? Can you help me?

nvidia-smi output is:

[root@a40 ~]# nvidia-smi
Sun Mar 21 09:15:12 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.04 Driver Version: 460.32.04 CUDA Version: N/A |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A40 On | 00000000:41:00.0 Off | 0 |
| 0% 29C P0 73W / 300W | 0MiB / 45634MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

mode compute is selected

[root@a40 ~]# ./displaymodeselector --listgpumodes

NVIDIA Display Mode Selector Utility (Version 1.48.0)
Copyright © 2015-2020, NVIDIA Corporation. All Rights Reserved.

Adapter: Graphics Device (10DE,2235,10DE,145A) S:00,B:41,D:00,F:00

EEPROM ID (EF,6015) : WBond W25Q16FW/JW 1.65-1.95V 16384Kx1S, page

GPU Mode: Compute

[root@a40]# ls /sys/bus/pci/devices/0000:41:00.0
aer_dev_correctable
aer_dev_fatal
aer_dev_nonfatal
ari_enabled
broken_parity_status
class
config
consistent_dma_mask_bits
current_link_speed
current_link_width
d3cold_allowed
device
dma_mask_bits
driver
driver_override
enable
i2c-5
i2c-6
iommu
iommu_group
irq
local_cpulist
local_cpus
max_link_speed
max_link_width
modalias
msi_bus
msi_irqs
numa_node
power
remove
rescan
reset
resource
resource0
resource1
resource1_wc
resource3
resource3_wc
revision
sriov_drivers_autoprobe
sriov_numvfs
sriov_offset
sriov_stride
sriov_totalvfs
sriov_vf_device
subsystem
subsystem_device
subsystem_vendor
uevent
vendor

i had the same problem with the a6000 - the manual does not really point it out that great but you need to do the following

/usr/lib/nvidia/sriov-manage -e 00:D8:0000.0 - in your case it should be a other device id.

once you did this it will create mdevs - i my case it created ~20 devices with device id +1 or so - meaning you cant use your original device id but need to take one of the others.

i have to do this after each boot ( could do it permanent if i wanted too=)

hope this helps you

Thanks Stefan,

your solution has worked!!

After doing “sriov-manage -e” appeared new directories in /sys/bus/pci/devices/$bus

root@a40# bus=$(nvidia-smi -q |grep ^GPU |awk -F " 0000" '{print tolower($2)}')
root@a40# /usr/lib/nvidia/sriov-manage -e $bus
root@a40 # ls /sys/bus/pci/devices/$bus/| grep ^virtfn |wc -l
32

These directories have a different aproach than NVDIa says in his documentation. There is not “mdev_supported_types” directory, but it appears 32 directories “virtfn0” to “virtfn31”.

In each virtfn* directory apperas a mdev_supported_types that contains all models of vGPU availables in this card.

For example:

root@a40# cat "/sys/bus/pci/devices/0000:41:00.0/virtfn0/mdev_supported_types/nvidia-557/name"
NVIDIA A40-1Q
root@a40# cat "/sys/bus/pci/devices/0000:41:00.0/virtfn0/mdev_supported_types/nvidia-557/available_instances" 
1

If you create a mdev device instance:

root@a40# uid=$(uuidgen)
root@a40# echo $uid > "/sys/bus/pci/devices/0000:41:00.0/virtfn0/mdev_supported_types/nvidia-557/create"
root@a40# cat "/sys/bus/pci/devices/0000:41:00.0/virtfn0/mdev_supported_types/nvidia-557/available_instances"
0

And if you create more instances, for example:

root@a40# uid=$(uuidgen)
root@a40# echo $uid > "/sys/bus/pci/devices/0000:41:00.0/virtfn1/mdev_supported_types/nvidia-557/create"

If we have crated 3 instances:

root@a40# ls /sys/bus/mdev/devices/ |wc -l 
3

And for a maximum of 32 instances of type NVIDIA A40-1Q we have 29 instances availables over all the directories:

root@a40# cat /sys/bus/pci/devices/0000\:41\:00.0/virtfn*/mdev_supported_types/nvidia-557/available_instances |grep 1 |wc -l 
29

I will that this strange behaviour could be explained by NVIDIA or solved in next releases of vGPU software.