Not all Tesla T4 recognized by NVIDIA driver

Hi all,

Our partner is setting up NVIDIA Telsa T4 on HPE DL380 Gen 10 and is having problem. Would you provide us some pointers please?

The setup is Tesla T4 x 7 on DL380 and 3 of Tesla T4 are recognized by NVIDIA driver, but the rest are not.

$ nvidia-smi
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:12:00.0 Off | 0 |
| N/A 65C P0 29W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla T4 Off | 00000000:13:00.0 Off | 0 |
| N/A 73C P0 33W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 Tesla T4 Off | 00000000:37:00.0 Off | 0 |
| N/A 70C P0 31W / 70W | 0MiB / 15109MiB | 4% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

All 4 are recognized on the OS side.

$ lspci | grep -i nvidia
12:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
13:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
37:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
86:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
af:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
b0:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
d8:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

Do we have to install anything else? We used NVIDIA driver downloaded from below.

NVIDIAドライバダウンロード

We installed the following driver.

DATA CENTER DRIVER FOR LINUX X64
バージョン: 460.32.03
リリース日: 2021.1.19
オペレーティングシステム: Linux 64-bit
CUDA Toolkit: 11.2
言語: Japanese
ファイルサイズ: 169.84 MB

OS: Ubuntu 20.04.2 LTS

Thank you for your support.

Please run nvidia-bug-report.sh as root
and attach the resulting nvidia-bug-report.log.gz file to your post.

Thanks for your reply. The problem has been fixed.
They have tried a few things so we don’t know exactly what fixed the problem. Things they tried;

  1. Updating BIOS
  2. Enabling dkms option upon installing NVIDIA driver
  3. Changing OS version
  4. Resetting BIOS settings

Thank you.

Hi, I’m having a very similar issue with a PH402 dual P100 card, seen in lspci but not recognized by nvidia-smi or cuda’s deviceDetect.

Can you specify which distribution/version you ended up with?

Thanks

$ cat /etc/os-release

NAME=“Ubuntu”

VERSION=“20.04.2 LTS (Focal Fossa)”

ID=ubuntu

ID_LIKE=debian

PRETTY_NAME=“Ubuntu 20.04.2 LTS”

VERSION_ID=“20.04”