Enable virtual GPUs on NVIDIA tesla T4

hpc3 · January 19, 2024, 11:35pm

Hello,

I’m trying to enable GPU virtualization using a Tesla T4 GPU in Rocky linux. I have no experience with virtual GPUs and with the changes between version 16 of the documentation I am a bit lost.

I have read NVIDIA has 2 different Tesla T4 GPU card models with similar device and revision numbers:

one which is SR-IOV capable (device_type: “type-PF”)
one which is NOT SR-IOV (device_type:type-PCI)

After install the driver for linux (535.129.03-537.70) and check it I verify that VM Host Server supports VT-D/IOMMU and SR-IOV technologies,

I disable the nouveau kernel module and after reboot I can see the modules containing the vfio string required dependencies.

The new mdev supported types are added in /sys/bus/pci/devices/.... directory but I can not enable virtual function as the documentation indicates here

/usr/lib/nvidia/sriov-manage -e ...

There is no error with this command but the virtual functions does not appear in the directory and It is not possible to create a vGPU.

Is it possible to use virtual GPU with Nvidia Tesla T4? This work with Nvidia A100 (Ampere).
In the latest versions (16.2) we may have to use NVIDIA AI enterprise?

Thanks in advance for your help

hpc3 · January 26, 2024, 3:14pm

Hi,
Any thoughts on this topic?

Consulting the PCI device I can see that it supports SR-IOV and Alternative Routing-ID Interpretation (ARI) and that the driver it is using is from NVIDIA.

lspci -v -s 17:00.0
17:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
	Subsystem: NVIDIA Corporation Device 145f
	Physical Slot: 4
	Flags: bus master, fast devsel, latency 0, IRQ 18, NUMA node 0, IOMMU group 29
	Memory at d6000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 25000000000 (64-bit, prefetchable) [size=64G]
	Memory at 27020000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] Null
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
	Capabilities: [100] Virtual Channel
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Capabilities: [bb0] Physical Resizable BAR
	Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
	Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00] Lane Margining at the Receiver <?>
	Capabilities: [e00] Data Link Feature <?>
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia

However, by querying with the nvidia-smi -q command I can see that the HOST VGPU mode is Non SR-IOV.(Editado)Recuperar original

==============NVSMI LOG==============

Timestamp                                 : Fri Jan 26 14:57:49 2024
Driver Version                            : 535.129.03
CUDA Version                              : Not Found
vGPU Driver Capability
        Heterogenous Multi-vGPU           : Supported

Attached GPUs                             : 1
GPU 00000000:4B:00.0
    Product Name                          : Tesla T4
    Product Brand                         : NVIDIA
    Product Architecture                  : Turing
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : N/A
    vGPU Device Capability
        Fractional Multi-vGPU             : Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
....
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : Non SR-IOV

sschaber · January 26, 2024, 3:21pm

Hi, and what is the issue/question?
I don’t get the full picture. In the first screenshot there is A100 and in the second there is T4.
T4 doesn’t support SR-IOV but for sure it supports vGPU.
Rocky Linux on the other hand is not supported so you would be on your own to test out how it could work or not.

Please see also our support matrix where you can see which GPU is supported on which hypervisor:

Regards
Simon

hpc3 · January 26, 2024, 3:42pm

Sorry, It was a mistake,

The correct output for Tesla T4 is:

lspci -v -s 4b:00.0
4b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
	Subsystem: NVIDIA Corporation Device 12a2
	Physical Slot: 1
	Flags: bus master, fast devsel, latency 0, IRQ 18, NUMA node 1, IOMMU group 40
	Memory at de000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 23fc0000000 (64-bit, prefetchable) [size=256M]
	Memory at 23ff0000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] Null
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
	Capabilities: [100] Virtual Channel
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Capabilities: [bb0] Physical Resizable BAR
	Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
	Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia

I can see Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)

I read NVIDIA has 2 different Tesla T4 GPU card models with similar device and revision numbers: (Error when configuring PCI Pass-through for NVIDIA Telsa T4 GPU in OpenStack - Red Hat Customer Portal)

one which is SR-IOV capable (device_type: “type-PF”)
one which is NOT SR-IOV (device_type:type-PCI)

I will try to configure it according to the documentation without SR-IOV support.

Thanks

Topic		Replies	Views
Cannot enable vGPUs on Tesla T4 machine in AWS GPU - Hardware	1	476	March 6, 2024
/sys/class/mdev_bus/ Can't Found NVIDIA Virtual GPU Drivers	13	6108	October 11, 2023
Is it possible to present multiple vGPU's to a single VM from a Tesla T4 card on ESXi 6.7? General Discussion	4	3440	July 9, 2020
GPU in a VM pass-through setting NVIDIA Virtual GPU Drivers	19	70294	April 29, 2021
Tesla P4 PCI pass through from RHOSP 13 to RHEL 7.6/Windows VMs issues Linux	16	1644	October 12, 2021
I can't boot VM with vGPU General Discussion	3	1241	February 10, 2023
How I Try Nvidia vGPU on Desktop PC GRID Test Drive	7	43125	January 9, 2023
What software to use for our new single NVIDIA T4 Tesla card on VMware 6.7 ESXi Host General Discussion	14	9978	August 17, 2020
[SOLVED] M10 with ESXi 6.5 - vGPU: Device not supported General Discussion	7	22704	October 18, 2017
Issue in vGPU setup in Ubuntu 20.04.3 General Discussion ubuntu	23	9322	January 10, 2022

Enable virtual GPUs on NVIDIA tesla T4

Related topics