Using A100 GPU in Ubuntu VM (vcs)

ambitious_science · July 13, 2022, 10:48am

I have a server provided by IT with an A100 GPU. In the Ubuntu 22.04 VM I installed the according grid driver from nvidia, namely “NVIDIA-Linux-x86_64-460.106.00-grid”. The installation completes suceefully. I then configured the license server for vcs and all seems fine. I get below output from nvidia-smi.

±----------------------------------------------------------------------------+
| MIG devices: |
±-----------------±---------------------±----------±----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 0 0 0 | 1820MiB / 20475MiB | 42 N/A | 3 0 2 0 0 |
| | 4MiB / 4096MiB | | |
±-----------------±---------------------±----------±----------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 0 0 4041 C …conda/envs/tf2/bin/python 12MiB |
±----------------------------------------------------------------------------+

Under “Processes” you can see a tensorflow 2 process. And this is where the issue starts. I can load tensorflow and gpu is detected but as soon as I try any calculation, it just hangs forever and nothing happens. As can be seen a process is created in the GPU but nothing seems to happen.

I also tried the official Nvidia docker container for tensorflow and it results in the exact same issue.

I also get no error message or log entries therefore I have no idea how to trouble-shoot. I have reinstalled the Nvidia driver but that didn’t help at all.

What could be the issue? How can I further troubleshoot?

System: Lubuntu 22.04 on VSphere 7.0.3

sschaber · July 13, 2022, 11:20am

Hi, could you please try a newer driver? vGPU 12.x is already end of life and I doubt it works properly with Ampere.

I would recommend to use at least the latest minor release from vGPU 13 branch.

Best regards
Simon

ambitious_science · July 13, 2022, 1:34pm

Just to confirm does the guest and host need to be matched? Or can I just update the guest?

EDIT: A100 is pretty old so 12.x or 460 driver supports it:

But yeah doesn’t hurt to try with a never version

EDIT 2:

According to the link you provided digging down further 13 and 14 does not support A100:

Feature Support Withdrawn in Release 13.0

The following GPUs are no longer supported:
- NVIDIA A100 HGX 80GB
- NVIDIA A100 PCIe 40GB
- NVIDIA A100 HGX 40GBInstead, these GPUs are supported with NVIDIA AI Enterprise.

So I think a never driver will work.

And this kind of annoying as I don’t care about that whole AI enterprise thing in fact it is distracting and a bait and switch really as the card wasn’t purchased for what AI enterprise seems to target. So a simple driver install is what we would need.

sschaber · July 13, 2022, 2:04pm

Unfortunately you are right. I overlooked the vSphere comment. Thought you are using KVM. You will need the NVAIE trial. No way to use vGPU as the .vib won’t support the A100.

Topic		Replies	Views
Monitor GPU usage with nvidia-smi Linux	6	5908	October 14, 2021
Cannot install NVIDIA driver on Ubuntu 22.04 with A100 Linux cuda , ubuntu , driver	2	4547	June 7, 2023
A100 drivers problem - Linux 22.04.3 Linux cuda , ubuntu , driver , a100	6	1920	March 20, 2024
failure to set vgpu computing mode from prohibited to default Linux	11	4007	September 19, 2022
Unable to run TensorFlow with vGPU General Discussion	2	5257	March 9, 2020
Can't install any nvidia driver for Quadro K3100M on Ubuntu 22 Linux	12	1215	February 21, 2024
A100 SUPPORT ON VSPHERE 6.7 or 7 NVIDIA Virtual GPU Technology	4	3870	October 2, 2022
Nvidia-smi reports "no device" for a V100 GPU on IBM cloud Linux cuda , nvidia-smi	7	1170	April 30, 2024
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	610	September 11, 2024
Nvidia driver conflict CUDA_ERROR_NO_DEVICE Linux	10	9774	June 28, 2018

Using A100 GPU in Ubuntu VM (vcs)

Feature Support Withdrawn in Release 13.0

Related topics