T4 GPU not detected, ubuntu 18.04 and 20.04

Hello guys,

I have a problem with T4 GPU, I installed them on different servers and had several problems
On some, they weren’t detected at all, no matter I have driver and cuda installed or not
Sometimes they were detected but after a reboot, they weren’t anymore, they would appear at some reboot
Servers would crash for no reason (no temperature problem or anything) after around 20min

Here is what I have the few times it works:
uname -a on ubuntu 20.04:
Linux *** 5.4.0-72-generic #80-Ubuntu SMP Mon Apr 12 17:35:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

uname -a on ubuntu 18.04:
Linux *** 4.15.0-142-generic #146-Ubuntu SMP Tue Apr 13 01:11:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

lspci :
01:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

nvidia-smi :
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:01:00.0 Off | 0 |
| N/A 82C P0 69W / 70W | 13620MiB / 15109MiB | 100% Default |
| | | N/A |

Is there someone who had similar problems with T4 GPU with ubuntu ?
Do you know what I could do to solve it ?

With ubuntu 18.04:
nvidia-bug-report.log.gz (56.4 KB)
(GPU not detected)

With ubuntu 20.04:
ubuntu2004-nvidia-bug-report.log.gz (401.7 KB)
(GPU has been detected)

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Files attached for ubuntu 18.04 and ubuntu 20.04
For ubuntu 20.04, it is a lucky boot on which GPU has been detected

On the 20.04 boot, the T4 is at >70°C while idle, the T4 doesn’t have a fan, it relies on the server chassis to provide it. Seems it’s already overheating and shutting down while the bios tries to detect it. Please provide proper airflow to the T4.

I increased fan speed, temperature seems to be stable at 36°C idle

Here is the log:
ubuntu2004nvidia-bug-report-2.log.gz (415.7 KB)

Is it still crashing?

42min so far and still no crash !

I increased fan speed on server with ubuntu 18.04 and GPU was detected.

I didn’t know the GPU would go that high in temperature just at the boot, I was usually just icnreasing fan speed when I was doing burn test on it

Attached is log for ubuntu 18.04:
ubuntu1804-nvidia-bug-report.log.gz (1.3 MB)

For proper operation, you’ll have to enable and run nvidia-persistenced, furthermore, disable the Xserver to start or use the T4.

I did nvidia-persistenced on both server as a root but I got the following error:
nvidia-persistenced failed to initialize. Check syslog for more details.

What do you mean by disable the X server ?
I was usually installing just the driver and cuda, I should run nvidia-persistenced after those install ?

On the 20.04 install, the desktop was enabled. Just disable it: sudo systemctl disable display-manager
Did you check the journal why nvidia-persistenced failed? Depending on install method, a systemd unit should be installed for it.
https://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon
https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/nvidia-persistenced.html