I have a server provided by IT with an A100 GPU. In the Ubuntu 22.04 VM I installed the according grid driver from nvidia, namely “NVIDIA-Linux-x86_64-460.106.00-grid”. The installation completes suceefully. I then configured the license server for vcs and all seems fine. I get below output from nvidia-smi.
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID A100-3-20C On | 00000000:02:00.0 Off | N/A |
| N/A N/A P0 N/A / N/A | 1820MiB / 20475MiB | N/A Default |
| | | Enabled |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| MIG devices: |
±-----------------±---------------------±----------±----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 0 0 0 | 1820MiB / 20475MiB | 42 N/A | 3 0 2 0 0 |
| | 4MiB / 4096MiB | | |
±-----------------±---------------------±----------±----------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 0 0 4041 C …conda/envs/tf2/bin/python 12MiB |
±----------------------------------------------------------------------------+
Under “Processes” you can see a tensorflow 2 process. And this is where the issue starts. I can load tensorflow and gpu is detected but as soon as I try any calculation, it just hangs forever and nothing happens. As can be seen a process is created in the GPU but nothing seems to happen.
I also tried the official Nvidia docker container for tensorflow and it results in the exact same issue.
I also get no error message or log entries therefore I have no idea how to trouble-shoot. I have reinstalled the Nvidia driver but that didn’t help at all.
What could be the issue? How can I further troubleshoot?
System: Lubuntu 22.04 on VSphere 7.0.3