What software to use for our new single NVIDIA T4 Tesla card on VMware 6.7 ESXi Host

Hi there, I am installing and configuring a NVIDIA T4 Tesla card on our new CCTV infrastructure using VMWare ESXi 6.7.
The problem is neither the CCTV software vendor Briefcam or Dell/VMware cant tell me what NVidia software option/license to setup. The CCTV software is being installed on 1 Virtual server (Server 2019) and it wants to see 1 GPU per process as follows.

Review GPUs 0.3
Research GPUs 0.6
Respond GPUs 0.7

Can anyone say if the best option would be the NVIDIA Virtual Compute Server software or NVIDIA GRID? Please keep inmind this is not for virtual desktops, its for vGPU presentation to this Briefcam software processes.

Hi

The easiest option is download a vGPU Evaluation so you can try the different features for yourself. This is free for 90 days.

Regards

MG

Yeah cheers I got that, I’m waiting for account clearance to download. I’m a bit surprised this is still somewhat uncharted grounds and no "experts" can advise a definitive course of action/software to use.

Hi

It’s not that it’s uncharted grounds, it’s that software has different requirements depending on it’s purpose, and as you have 3 different bits of software that undoubtedly do different things (especially in a modern CCTV system), they may all have different requirements, so the easiest thing to do is validate exactly what you need with a quick Eval.

Are you Training? Inferencing? Rendering? All of them?

If you’re rendering or want accelerated video then you may want QvDWS. If you’re only Inferencing / Training then vCS will be fine. Budget for QvDWS (as that covers all license options), but if you can get away with vCS in the Eval then go for that.

Regards

MG

Hi MG, its hard to get any specifics from the software vendor, here is the detail from their website.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
The BriefCam Processing Server requires an NVIDIA GPU to process video. The exact model and amount of GPU cards required will depend on the requirements of the resolution of video to be processed along with the number of hours of video to process per day.

For a system with a large video processing requirement, multiple GPUs can be installed in each processing server, as well as the ability to utilize multiple processing servers simultaneously.

Each GPU installed in the BriefCam server must be assigned to process video in a specific processing mode, either for on-demand processing (REVIEW and RESEARCH modules) or for real-time processing (RESPOND module), a single GPU cannot process video both real-time and on-demand.
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

We are using 1 Virtual machine (server 2019) sitting on 1 VMware Host Dell VxRail V570 with 1 Tesla T4 card. Trying to alocate an individual vGPU per process of the 3 porcesses.

I have created an account to download the trial software.
Downloaded NVIDIA-GRID-vSphere-6.7-430.99-432.44, and NVIDIA-ls-Windows-2020.05.0.28406365 (for license server).
I have installed the follwing VIB NVIDIA-VMware-430.99-1OEM.670.0.0.8169922.x86_64.vib on the ESXi 6.7 Host after a successful install, reboot then running the command nvidia-smi I dont see any vGPU’s listed…!

Oh the output from the nvidia-smi command is:

[root@vxesxi5:~] nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

MG, I tried a reinstall of the driver, made no change, then listed the vibs installed, so the driver is installed. See below.

[root@vxesxi5:~] esxcli software vib install -v /tmp/NVIDIA_bootbank_NVIDIA-VMware_ESXi_6.7_Host_Driver_430.99-1OEM.670.0.0.8169922.vib
Installation Result
Message: Host is not changed.
Reboot Required: false
VIBs Installed:
VIBs Removed:
VIBs Skipped: NVIDIA_bootbank_NVIDIA-VMware_ESXi_6.7_Host_Driver_430.99-1OEM.670.0.0.8169922
[root@vxesxi5:~] esxcli software vib list
Name Version Vendor Acceptance Level Install Date


bnxtnet 214.0.230.0-1OEM.670.0.0.8169922 BCM VMwareCertified 2020-06-10
bnxtroce 214.0.187.0-1OEM.670.0.0.8169922 BCM VMwareCertified 2020-06-10
dellptagent 1.9.4-41 DEL VMwareAccepted 2020-06-10
dcism 3.4.1.ESXi6-1818 Dell VMwareAccepted 2020-06-10
lpfc 12.2.373.1-1OEM.670.0.0.8169922 EMU VMwareCertified 2020-06-10
i40en 1.8.6-1OEM.670.0.0.8169922 INT VMwareCertified 2020-06-10
igbn 1.4.10-1OEM.670.0.0.8169922 INT VMwareCertified 2020-06-10
ixgben 1.7.17-1OEM.670.0.0.8169922 INT VMwareCertified 2020-06-10
nmlx5-core 4.17.15.16-1OEM.670.0.0.8169922 MEL VMwareCertified 2020-06-10
nmlx5-rdma 4.17.15.16-1OEM.670.0.0.8169922 MEL VMwareCertified 2020-06-10
NVIDIA-VMware_ESXi_6.7_Host_Driver 430.99-1OEM.670.0.0.8169922 NVIDIA VMwareAccepted 2020-07-06

Just FYI I follwed this process:
I disabled the passthrough setting in ESXi Host by PCI-Devices.
I change with vSphere Web Client the Settings for the Host Graphics and Graphics Devices to shared direct.
Restart the ESXi Host. It seems to have goten the card to show up.

[root@vxesxi5:~] nvidia-smi
Tue Jul 7 00:54:46 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.99 Driver Version: 430.99 CUDA Version: N/A |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:3B:00.0 Off | 0 |
| N/A 42C P8 17W / 70W | 75MiB / 15359MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
[root@vxesxi5:~] vmkload_mod -l | grep nvidia
nvidia 13 18176

I can now add "a" GRID vGPU, I chose the GRID_T4-16C profile.

When trying to add a second vGPU I am unable to add more than 1 to the one VM, it states I should be able to add up to 4 vGPU’s to the one VM. I just get the error: The maximum number of devices of this type has been reached, what am I doing wrong?

You’re funny man. Add more GPUs. How do you want to add multiple T4s if you have only 1 physical GPU present?

Hi SSD

Good progress so far, nice work getting it up and running :-)

Yes, within vCenter you need to change to "Shared Direct" for vGPU to be enabled, the other setting is for Passthrough which you aren’t using in this instance. This is a "Per Host" setting, not "Per Cluster", so you’ll need to set it on all VX Nodes that have a GPU in.

Regarding vGPU Profiles, the number in the Profile equates to Framebuffer. 1 = 1GB / 2 = 2GB / 4 = 4GB / 8 = 8GB / 16 = 16GB. Other GPUs can go higher as they have more Framebuffer (24 / 32 / 40 / 48).

With vGPU, you hard allocate the framebuffer per VM, but share the GPU processing / encoding / decoding cycles between VMs. What this means is that if you want to run 4 VMs at the same time, the maximum Profile size you can use with a 16GB GPU is 4GB. In case you aren’t aware, detailed documentation is available from here: NVIDIA Virtual GPU (vGPU) Software Documentation The vGPU Software User Guide lists all the available Profiles per GPU (as well as other information) and is worth having a read through: Virtual GPU Software User Guide :: NVIDIA Virtual GPU Software Documentation

What I would do in this instance, is start off with a well spec’d VM. Give it 8 vCPUs, 16GB (System) RAM and a 16Q vGPU Profile (Quadro, not Compute) at this stage and run each of the applications independently (1 VM at a time as it’s a 16GB Profile) to build up an understanding of how the application uses the hardware and what kind of resources it needs. Using the "Q" Profile means that you’re enabling all of the GPUs features and performance and as you’ll only be testing 1 VM at a time, you can allocate all of the Framebuffer. Once the Application is working and you are happy with performance and have tailored the resources (CPU / RAM / GPU Framebuffer), you can then change the GPUs Profile to a "C" and run the same tests to see what the differences are. You can then decide which vGPU Profile (C or Q) is appropriate for each Application. Be aware though, I see that the website mentions the Application can make use of multiple GPUs, which means that it may be quite heavy, meaning you could end up requiring an entire T4 per VM purely for processing cycles. Your testing will give you the results and you can then decide how to proceed.

To help you with resource monitoring, you can use a tool called "GPU Profiler": Releases · JeremyMain/GPUProfiler · GitHub

Once your testing is complete, you should have a complete Resource Profile for each of the Applications. Each Application may require different amounts of resources depending on what it’s doing, so don’t assume they will all be the same. This includes CPU, RAM and GPU. GPU Profiler can help with monitoring all of those at the same time and will create a nice graph you can save and refer to later on. However, if the Application is also Multi-Threaded on the CPU, then just use (Windows) Task Manager to see what each Thread is doing and whether you need to scale up or down on the CPU side.

Don’t forget to optimize Windows as well: VMware OS Optimization Tool | VMware Flings

Let me know how you get on …

(FYI - I’m UK based, so there’s a bit of a time difference between us ;-) )

Regards

MG

Wow awesome information thanks MG, I had worked out the vGPU’s profiles Framebuffer size, did not know about the Q=Quadro.

As budget would allow this card is going to be doing all the work on one 128GB Mem 4xCPU VM, so we are going to simply try giving the vGPU’s 4GB profile (NVIDIA GRID vGPU grid_t4-4c) each to begin with and tune from there, thanks for the link to GPU profiler etc. this will come in handy.

We only have this 1 Host with the T4 GPU card, so its kinda on its own running 1 VM, any changes i make is to the Host alone, not the cluster.

I hadnt seen your response to this post and started another thread in here about the issues starting a VM with more than 1 vGPU, can you please have a read of it and tell me what you think?

So in UK your working now?? And I’m up late trying to get this thing working…after work lol

Regards,
SSD

Hi SSD

Responded to the other post as well :-)

May be a bit of confusion …

As each Application wants to see a GPU per process, I’m assuming that you would run multiple VMs, and run each Application on a dedicated VM? (up to 4 VMs with a 4Q or 4C Profile assigned to each). This would share the T4s resources across 4 individual VMs simultaneously.

The alternative to that, is to assign the entire T4 to a single VM using the 16Q / 16C Profile. However, you mention above that the Application wants to see 1 GPU per process, so I’m not sure if that will work. Do these Applications all need to run at the same time as each other?

Yes, working now … :-)

Regards

MG

I have installed the follwing VIB NVIDIA-VMware-430.99-1OEM.670.0.0.8169922.x86_64.vib on the ESXi 6.7 Host after a successful install, reboot then running the command nvidia-smi I dont see any vGPU’s listed..!

Hi

Make sure your Host Graphics Setting is still configured for "Shared Direct".

Regards

MG