Enable virtual GPUs on NVIDIA tesla T4

Hello,

I’m trying to enable GPU virtualization using a Tesla T4 GPU in Rocky linux. I have no experience with virtual GPUs and with the changes between version 16 of the documentation I am a bit lost.

I have read NVIDIA has 2 different Tesla T4 GPU card models with similar device and revision numbers:

  • one which is SR-IOV capable (device_type: “type-PF”)
  • one which is NOT SR-IOV (device_type:type-PCI)

After install the driver for linux (535.129.03-537.70) and check it I verify that VM Host Server supports VT-D/IOMMU and SR-IOV technologies,

I disable the nouveau kernel module and after reboot I can see the modules containing the vfio string required dependencies.

The new mdev supported types are added in /sys/bus/pci/devices/.... directory but I can not enable virtual function as the documentation indicates here

/usr/lib/nvidia/sriov-manage -e ...

There is no error with this command but the virtual functions does not appear in the directory and It is not possible to create a vGPU.

Is it possible to use virtual GPU with Nvidia Tesla T4? This work with Nvidia A100 (Ampere).
In the latest versions (16.2) we may have to use NVIDIA AI enterprise?

Thanks in advance for your help

Hi,
Any thoughts on this topic?

Consulting the PCI device I can see that it supports SR-IOV and Alternative Routing-ID Interpretation (ARI) and that the driver it is using is from NVIDIA.

lspci -v -s 17:00.0
17:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
	Subsystem: NVIDIA Corporation Device 145f
	Physical Slot: 4
	Flags: bus master, fast devsel, latency 0, IRQ 18, NUMA node 0, IOMMU group 29
	Memory at d6000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 25000000000 (64-bit, prefetchable) [size=64G]
	Memory at 27020000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] Null
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
	Capabilities: [100] Virtual Channel
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Capabilities: [bb0] Physical Resizable BAR
	Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
	Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00] Lane Margining at the Receiver <?>
	Capabilities: [e00] Data Link Feature <?>
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia

However, by querying with the nvidia-smi -q command I can see that the HOST VGPU mode is Non SR-IOV.(Editado)Recuperar original

==============NVSMI LOG==============

Timestamp                                 : Fri Jan 26 14:57:49 2024
Driver Version                            : 535.129.03
CUDA Version                              : Not Found
vGPU Driver Capability
        Heterogenous Multi-vGPU           : Supported

Attached GPUs                             : 1
GPU 00000000:4B:00.0
    Product Name                          : Tesla T4
    Product Brand                         : NVIDIA
    Product Architecture                  : Turing
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : N/A
    vGPU Device Capability
        Fractional Multi-vGPU             : Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
....
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : Non SR-IOV

Hi, and what is the issue/question?
I don’t get the full picture. In the first screenshot there is A100 and in the second there is T4.
T4 doesn’t support SR-IOV but for sure it supports vGPU.
Rocky Linux on the other hand is not supported so you would be on your own to test out how it could work or not.

Please see also our support matrix where you can see which GPU is supported on which hypervisor:

Regards
Simon

Sorry, It was a mistake,

The correct output for Tesla T4 is:

lspci -v -s 4b:00.0
4b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
	Subsystem: NVIDIA Corporation Device 12a2
	Physical Slot: 1
	Flags: bus master, fast devsel, latency 0, IRQ 18, NUMA node 1, IOMMU group 40
	Memory at de000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 23fc0000000 (64-bit, prefetchable) [size=256M]
	Memory at 23ff0000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] Null
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
	Capabilities: [100] Virtual Channel
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Capabilities: [bb0] Physical Resizable BAR
	Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
	Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia

I can see Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)

I read NVIDIA has 2 different Tesla T4 GPU card models with similar device and revision numbers: (Error when configuring PCI Pass-through for NVIDIA Telsa T4 GPU in OpenStack - Red Hat Customer Portal)

  • one which is SR-IOV capable (device_type: “type-PF”)
  • one which is NOT SR-IOV (device_type:type-PCI)

I will try to configure it according to the documentation without SR-IOV support.

Thanks