A100 GPUs visible on nvidia-smi not visible for Pytorch or on cuda-samples

franck.tison · March 2, 2021, 11:24am

Hi,

We have a couple of Supermicro server AS -4124GO-NART with 8x A100 HGX.
Servers are installed in centos 7, it’s a up-to-date install that I use on many other server (Dell, HPC, etc).
> [bash]# uname -a
> Linux chasma-01.int.europe.naverlabs.com 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cuda 11.2 was install with cuda rhel7 repo
> [bash]# nvidia-smi
> Tue Mar 2 12:13:53 2021
> ±----------------------------------------------------------------------------+
> | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
> |-------------------------------±---------------------±---------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
> | | | MIG M. |
> |===============================+======================+======================|
> | 0 A100-SXM4-40GB Off | 00000000:07:00.0 Off | 0 |
> | N/A 26C P0 43W / 400W | 0MiB / 40536MiB | 0% Default |
> | | | Disabled |
> …
> ±------------------------------±---------------------±---------------------+
> | 7 A100-SXM4-40GB Off | 00000000:CA:00.0 Off | 0 |
> | N/A 26C P0 41W / 400W | 0MiB / 40536MiB | 0% Default |
> | | | Disabled |
> ±------------------------------±---------------------±---------------------+
>
> ±----------------------------------------------------------------------------+
> | Processes: |
> | GPU GI CI PID Type Process name GPU Memory |
> | ID ID Usage |
> |=============================================================================|
> | No running processes found |
> ±----------------------------------------------------------------------------+

Gpus are visible with lscpi or /proc/driver
> [bash]# lspci | grep A100
> 07:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
> 0a:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
> 47:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
> 4d:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
> 87:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
> 8d:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
> c7:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
> ca:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
> [bash]# cat /proc/driver/nvidia/gpus/0000:07:00.0/information
> Model: A100-SXM4-40GB
> IRQ: 75
> GPU UUID: GPU-52e14df6-efe7-359c-97d4-4c0fb0831df6
> Video BIOS: 92.00.36.00.04
> Bus Type: PCIe
> DMA Size: 47 bits
> DMA Mask: 0x7fffffffffff
> Bus Location: 0000:07:00.0
> Device Minor: 2
> Blacklisted: No

But there are visible in pytorch, tensorflow, cuda-sample compile or container : nvcr.io/nvidia/pytorch:21.02-py3

generix · March 2, 2021, 12:23pm

Please enable the nvidia-persistenced to start on boot, make sure it is continuously running. Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. Please post the output of the deviceQuery demo.

franck.tison · March 2, 2021, 1:18pm

Hi generix,

Please find the bug report.
nvidia-bug-report.log.gz (5.1 MB)

I think you are talking about deviceQuery in the cuda-sample.
[bash]# ./cuda-samples/bin/x86_64/linux/release/deviceQuery
./cuda-samples/bin/x86_64/linux/release/deviceQuery Starting…

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 802
-> system not yet initialized
Result = FAIL

generix · March 2, 2021, 1:47pm

Please start the fabric-manager
https://www.supermicro.com/support/faqs/faq.cfm?faq=31029

franck.tison · March 2, 2021, 1:57pm

Thanks,
It fix it.