vGPU of Telsa T4 not seen on ESX 6.7

Hi,

On a ESX 6.7, I installed this drivers:
NVIDIA-VMware_ESXi_6.7_Host_Driver-440.53-1OEM.670.0.0.8169922.x86_64.vib

but I’m not able to have nvidia on my vms,

and the commande nvidia-smi vgpu answer is :
[root@localhost:/vmfs] nvidia-smi vgpu
Not supported devices in vGPU mode

However, the nvidia-smi command says:
[root@localhost:/vmfs] nvidia-smi
Thu Feb 13 09:36:42 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.53 Driver Version: 440.53 CUDA Version: N/A |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:5E:00.0 Off | 0 |
| N/A 37C P8 17W / 70W | 92MiB / 15359MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2104612 G Xorg 5MiB |
±----------------------------------------------------------------------------+

Could it be a problem with the driver ?
If that’s the case, is there another driver I have to use ?

Thanx for you help !
John

Which Host? vCenter settings checked for shared direct?

The computer is a supermicro. Supermicro has already said the computer is compatible with the Telsa T4.
The vm is a windows 10.

Thanx

"vCenter settings checked for shared direct?" ?? I don’t understand. And I thought that vcenter isn’t working with the 6.7 (?)…

By the way, the result of the nvidia-smi -q command is:

==============NVSMI LOG==============

Timestamp : Wed Mar 11 16:21:30 2020
Driver Version : 440.53
CUDA Version : Not Found

Attached GPUs : 1
GPU 00000000:5E:00.0
Product Name : Tesla T4
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322419111424
GPU UUID : GPU-34d6d925-61d7-ca33-9c9b-34420d8614c9
Minor Number : 0
VBIOS Version : 90.04.38.00.03
MultiGPU Board : No
Board ID : 0x5e00
GPU Part Number : 900-2G183-0000-001
Inforom Version
Image Version : G183.0200.00.02
OEM Object : 1.1
ECC Object : 5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : Host VSGA
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x5E
Device : 0x00
Domain : 0x0000
Device Id : 0x1EB810DE
Bus Id : 00000000:5E:00.0
Sub System Id : 0x12A210DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 15359 MiB
Used : 92 MiB
Free : 15267 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Temperature
GPU Current Temp : 47 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 85 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 18.36 W
Power Limit : 70.00 W
Default Power Limit : 70.00 W
Enforced Power Limit : 70.00 W
Min Power Limit : 60.00 W
Max Power Limit : 70.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Default Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Max Clocks
Graphics : 1590 MHz
SM : 1590 MHz
Memory : 5001 MHz
Video : 1470 MHz
Max Customer Boost Clocks
Graphics : 1590 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 2100289
Type : G
Name : Xorg
Used GPU Memory : 5 MiB

You need to change the GPU mode to "Shared Direct" in vCenter. Otherwise it won’t work. Please follow our documentation

OK Thanx, but I don’t know how to do that (?)

OK. I get it ! (sorry, a matter of RFM…)

By the way, is the DirectShared different than the usual passthrough ? Because of what we want to do is vGPU shared instead of simple passing thru… !!

All the Best

What we want to do is associate some vm (at least 8) on one server with vGPU T4_2Q profile using Nvidia GRID vDWS type lisense.

Apparently the virtual mod isn’t active !!
How can I do it ?

Shared direct is the vGPU mode. You should now be able to add a vGPU profile to a VM. Be aware that you must not add Passthrough devices!!!

The point is that doing that (deactivate the relay), when I create a vm, I do not see any vGPU, or I don’t know how to add a vGPU profile !
In other hand, when I create a new VM, what should I do to add vGPU ?

Hi,
just follow our documentation: Quick Start Guide :: NVIDIA Virtual GPU Software Documentation
You need to make sure that you have an Enterprise Plus license in place to add vGPU profiles.

Regards
Simon

The point is that I do not see any shared PCI device option, never !
Do I have to activate direct shared ? And how if it’s the case ? (I didn’t find the way to do so with the vsphere web UI in the host options …)

By the way, thanx Simon !

And the nvidia-smi vgpu on the hypervisor says:

[root@localhost:~] nvidia-smi vgpu
Not supported devices in vGPU mode

What is wrong ?!

Please run nvidia-smi without vgpu command and post the output.

here you are…

[root@localhost:~] nvidia-smi
Mon Mar 16 03:25:04 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.53 Driver Version: 440.53 CUDA Version: N/A |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:5E:00.0 Off | 0 |
| N/A 38C P8 17W / 70W | 92MiB / 15359MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2100302 G Xorg 5MiB |
±----------------------------------------------------------------------------+

One other point:
Due to pandemic situation in France we experiment som difficulties to comunicate and then obtain some information concerning that problem, wich is starting to be urgent (!).

Also I should notice that I tried the 440.53 version of the driver, and also the 430.83 (which is more recent than the 440 (according to NVIDIA’s site), with exactly the same results !

the result of the nvidia-smi -a command is: (which says that the cGPU mode is VSGA wich is not what we want (!))

==============NVSMI LOG==============

Timestamp : Mon Mar 16 17:24:11 2020
Driver Version : 430.83
CUDA Version : Not Found

Attached GPUs : 1
GPU 00000000:5E:00.0
Product Name : Tesla T4
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322419111424
GPU UUID : GPU-34d6d925-61d7-ca33-9c9b-34420d8614c9
Minor Number : 0
VBIOS Version : 90.04.38.00.03
MultiGPU Board : No
Board ID : 0x5e00
GPU Part Number : 900-2G183-0000-001
Inforom Version
Image Version : G183.0200.00.02
OEM Object : 1.1
ECC Object : 5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : Host VSGA
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x5E
Device : 0x00
Domain : 0x0000
Device Id : 0x1EB810DE
Bus Id : 00000000:5E:00.0
Sub System Id : 0x12A210DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 15359 MiB
Used : 92 MiB
Free : 15267 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Temperature
GPU Current Temp : 39 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 85 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 17.41 W
Power Limit : 70.00 W
Default Power Limit : 70.00 W
Enforced Power Limit : 70.00 W
Min Power Limit : 60.00 W
Max Power Limit : 70.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Default Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Max Clocks
Graphics : 1590 MHz
SM : 1590 MHz
Memory : 5001 MHz
Video : 1470 MHz
Max Customer Boost Clocks
Graphics : 1590 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 2100900
Type : G
Name : Xorg
Used GPU Memory : 5 MiB

Please we will apreciate some quick help in that matter !
Thanx

Hi John

It’s not a problem with the driver.

On your vSphere Host, uninstall the 430.83 (don’t upgrade), reboot the Host and re-install 440.53 so you’re running the most up to date version. That takes care of that.

Once the driver has been reinstalled, make sure vCenter is still configured to "Shared Direct". Now that side of the install is taken care of and there’s no need to revisit it.

Have you made any changes to the Server BIOS? If no, please review these articles to make sure your BIOS is configured correctly:

NVIDIA Support Article: Incorrect BIOS settings on a server when used with a hypervisor can cause MMIO address issues that result in GRID GPUs failing to be recognized. | NVIDIA

Use Page 23: https://images.nvidia.com/content/pdf/vgpu/guides/vgpu-deployment-guide-horizon-on-vsphere-final.pdf

Please also confirm that you are running Enterprise Plus licensing on your vSphere Hosts? (vCenter is fine with Standard licensing) …

Let us know how you get on

Regards

MG

Hi MG,

It’s still not working…

  1. I uninstall the 430.83
  2. reboot
  3. install the 440.53
  4. I don’t know how to make sur it’s working in Shared Direct
  5. the only bios thing I did (before to install ESX 6.7 the first time) was to check that the SR-IOV was enabled (and actually that was it). Then no changed to the factury bios.

I read the bios link of your post but, because of the confinement of peapole in France (COVID-19), I’m not able to go to my work place in front of the server that time and I don’t know when I will be able to do so…

I can for sure confirm I’m using the Enterprise Plus version of ESX
" VMware vSphere with Operations Management 6 Enterprise Plus "

But stil:
[root@localhost:~] nvidia-smi
Mon Mar 16 21:06:43 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.53 Driver Version: 440.53 CUDA Version: N/A |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:5E:00.0 Off | 0 |
| N/A 37C P8 17W / 70W | 92MiB / 15359MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2100312 G Xorg 5MiB |
±----------------------------------------------------------------------------+
means still no GRID !

and also:
[root@localhost:~] nvidia-smi vgpu
Not supported devices in vGPU mode

I’m lost !! Help…
Regards