I’m trying to enable GPU virtualization using a Tesla T4 GPU in Rocky linux. I have no experience with virtual GPUs and with the changes between version 16 of the documentation I am a bit lost.
I have read NVIDIA has 2 different Tesla T4 GPU card models with similar device and revision numbers:
one which is SR-IOV capable (device_type: “type-PF”)
one which is NOT SR-IOV (device_type:type-PCI)
After install the driver for linux (535.129.03-537.70) and check it I verify that VM Host Server supports VT-D/IOMMU and SR-IOV technologies,
I disable the nouveau kernel module and after reboot I can see the modules containing the vfio string required dependencies.
The new mdev supported types are added in /sys/bus/pci/devices/.... directory but I can not enable virtual function as the documentation indicates here
/usr/lib/nvidia/sriov-manage -e ...
There is no error with this command but the virtual functions does not appear in the directory and It is not possible to create a vGPU.
Is it possible to use virtual GPU with Nvidia Tesla T4? This work with Nvidia A100 (Ampere).
In the latest versions (16.2) we may have to use NVIDIA AI enterprise?
Consulting the PCI device I can see that it supports SR-IOV and Alternative Routing-ID Interpretation (ARI) and that the driver it is using is from NVIDIA.
lspci -v -s 17:00.0
17:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
Subsystem: NVIDIA Corporation Device 145f
Physical Slot: 4
Flags: bus master, fast devsel, latency 0, IRQ 18, NUMA node 0, IOMMU group 29
Memory at d6000000 (32-bit, non-prefetchable) [size=16M]
Memory at 25000000000 (64-bit, prefetchable) [size=64G]
Memory at 27020000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [68] Null
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
Capabilities: [100] Virtual Channel
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?>
Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia
However, by querying with the nvidia-smi -q command I can see that the HOST VGPU mode is Non SR-IOV.(Editado)Recuperar original
==============NVSMI LOG==============
Timestamp : Fri Jan 26 14:57:49 2024
Driver Version : 535.129.03
CUDA Version : Not Found
vGPU Driver Capability
Heterogenous Multi-vGPU : Supported
Attached GPUs : 1
GPU 00000000:4B:00.0
Product Name : Tesla T4
Product Brand : NVIDIA
Product Architecture : Turing
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Not Supported
....
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : Non SR-IOV
Hi, and what is the issue/question?
I don’t get the full picture. In the first screenshot there is A100 and in the second there is T4.
T4 doesn’t support SR-IOV but for sure it supports vGPU.
Rocky Linux on the other hand is not supported so you would be on your own to test out how it could work or not.
Please see also our support matrix where you can see which GPU is supported on which hypervisor: