Ubuntu 18.04/Drivers 430.50 : 2080ti cannot be used in Linux (ERROR reported in nvidia-smi)

Hello,

With the latest version of NVidia drivers on Ubuntu 18.04 and an up-to-date system, my GPU 0 (workstation with 4 2080ti) has issues and cannot be used. The 3 other cards have no problems.

In nvidia-smi, it is marked with an error (nvidia-smi takes fez seconds to print its result instead of instantly as usual) and it cannot be used for any tasks (such as deep learning). The issue happens no matter if the monitor is connected to this GPU or not. In case it’s connected to this GPU, nothing is displayed and the screen remain black after boot.

I am not sure if it’s an hardware problem or a driver problem. I reinstalled the system and tried with 2 kernel versions (HWE kernel 5.0.0.23 and 5.0.0.31) and the issue happens in all cases. As all our other systems with the same configuration work perfectly, I guess it’s an hardware issue but I would like to know more about this and find how I could diagnose this issue by myself in the future.

I’ll attach the nvidia-bug-report result to this post, hoping it will be useful. Thank you for your time.

Best regards,
– Gauthier
nvidia-bug-report.log.gz (99 KB)

You’re running into

[   12.282845] NVRM: GPU at PCI:0000:19:00: GPU-5e27f482-9dfa-a312-b8f8-abee669efe17
[   12.282850] NVRM: GPU Board Serial Number: 
[   12.282853] NVRM: Xid (PCI:0000:19:00): 62, 0cb5(2d50) 00000000 00000000

Unfortunately, XID 62 is very unspecific, might be hardware or software. Since the same model works in a different slot, this is probably hardware. Better check the failing card in a different system to confirm.

I did more tests (switching card slot and test it on another computer) and it was indeed an hardware issue.

We sent the card to AS for replacement. Thank you for your answer.