Using A100 GPU in Ubuntu VM (vcs)

I have a server provided by IT with an A100 GPU. In the Ubuntu 22.04 VM I installed the according grid driver from nvidia, namely “NVIDIA-Linux-x86_64-460.106.00-grid”. The installation completes suceefully. I then configured the license server for vcs and all seems fine. I get below output from nvidia-smi.

±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID A100-3-20C On | 00000000:02:00.0 Off | N/A |
| N/A N/A P0 N/A / N/A | 1820MiB / 20475MiB | N/A Default |
| | | Enabled |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| MIG devices: |
±-----------------±---------------------±----------±----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 0 0 0 | 1820MiB / 20475MiB | 42 N/A | 3 0 2 0 0 |
| | 4MiB / 4096MiB | | |
±-----------------±---------------------±----------±----------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 0 0 4041 C …conda/envs/tf2/bin/python 12MiB |
±----------------------------------------------------------------------------+

Under “Processes” you can see a tensorflow 2 process. And this is where the issue starts. I can load tensorflow and gpu is detected but as soon as I try any calculation, it just hangs forever and nothing happens. As can be seen a process is created in the GPU but nothing seems to happen.

I also tried the official Nvidia docker container for tensorflow and it results in the exact same issue.

I also get no error message or log entries therefore I have no idea how to trouble-shoot. I have reinstalled the Nvidia driver but that didn’t help at all.

What could be the issue? How can I further troubleshoot?

System: Lubuntu 22.04 on VSphere 7.0.3

Hi, could you please try a newer driver? vGPU 12.x is already end of life and I doubt it works properly with Ampere.

I would recommend to use at least the latest minor release from vGPU 13 branch.

Best regards
Simon

Just to confirm does the guest and host need to be matched? Or can I just update the guest?

EDIT: A100 is pretty old so 12.x or 460 driver supports it:

But yeah doesn’t hurt to try with a never version

EDIT 2:

According to the link you provided digging down further 13 and 14 does not support A100:

Feature Support Withdrawn in Release 13.0

  • The following GPUs are no longer supported:
    • NVIDIA A100 HGX 80GB
    • NVIDIA A100 PCIe 40GB
    • NVIDIA A100 HGX 40GBInstead, these GPUs are supported with NVIDIA AI Enterprise.

So I think a never driver will work.

And this kind of annoying as I don’t care about that whole AI enterprise thing in fact it is distracting and a bait and switch really as the card wasn’t purchased for what AI enterprise seems to target. So a simple driver install is what we would need.

Unfortunately you are right. I overlooked the vSphere comment. Thought you are using KVM. You will need the NVAIE trial. No way to use vGPU as the .vib won’t support the A100.