Using A100 GPU in Ubuntu VM (vcs)

I have a server provided by IT with an A100 GPU. In the Ubuntu 22.04 VM I installed the according grid driver from nvidia, namely “NVIDIA-Linux-x86_64-460.106.00-grid”. The installation completes suceefully. I then configured the license server for vcs and all seems fine. I get below output from nvidia-smi.

| NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 GRID A100-3-20C On | 00000000:02:00.0 Off | N/A |
| N/A N/A P0 N/A / N/A | 1820MiB / 20475MiB | N/A Default |
| | | Enabled |

| MIG devices: |
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
| 0 0 0 0 | 1820MiB / 20475MiB | 42 N/A | 3 0 2 0 0 |
| | 4MiB / 4096MiB | | |

| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| 0 0 0 4041 C …conda/envs/tf2/bin/python 12MiB |

Under “Processes” you can see a tensorflow 2 process. And this is where the issue starts. I can load tensorflow and gpu is detected but as soon as I try any calculation, it just hangs forever and nothing happens. As can be seen a process is created in the GPU but nothing seems to happen.

I also tried the official Nvidia docker container for tensorflow and it results in the exact same issue.

I also get no error message or log entries therefore I have no idea how to trouble-shoot. I have reinstalled the Nvidia driver but that didn’t help at all.

What could be the issue? How can I further troubleshoot?

System: Lubuntu 22.04 on VSphere 7.0.3

Hi, could you please try a newer driver? vGPU 12.x is already end of life and I doubt it works properly with Ampere.

I would recommend to use at least the latest minor release from vGPU 13 branch.

Best regards

Just to confirm does the guest and host need to be matched? Or can I just update the guest?

EDIT: A100 is pretty old so 12.x or 460 driver supports it:

But yeah doesn’t hurt to try with a never version


According to the link you provided digging down further 13 and 14 does not support A100:

Feature Support Withdrawn in Release 13.0

  • The following GPUs are no longer supported:
    • NVIDIA A100 HGX 80GB
    • NVIDIA A100 PCIe 40GB
    • NVIDIA A100 HGX 40GBInstead, these GPUs are supported with NVIDIA AI Enterprise.

So I think a never driver will work.

And this kind of annoying as I don’t care about that whole AI enterprise thing in fact it is distracting and a bait and switch really as the card wasn’t purchased for what AI enterprise seems to target. So a simple driver install is what we would need.

Unfortunately you are right. I overlooked the vSphere comment. Thought you are using KVM. You will need the NVAIE trial. No way to use vGPU as the .vib won’t support the A100.