GeForce 2080 RTX ti on Ubuntu 18.04 stops working after a while

miguel.melchor · April 9, 2019, 1:42pm

Hello,

I have been redirected here by NVIDIA Customer Care (incident 190227-000272).

We have a GEFORCE RTX 2080 Ti (s/n 0324118055234) bought from NVIDIA on a machine with Ubuntu 18.04.
We intend to use the board for Deep Learning computing (nvidia driver 415, CUDA10).
At times the board is properly reported by the nvidia-smi command. We launch our computations,
the board starts working but after a while (say 30 min) the card stops (crashes ?) and is not detected anymore by nvidia-smi command.

In an attempt to troubleshoot the problem we installed the board on another Linux machine (Ubuntu 18.04 also) but the behavior was exactly the same (available at first, failed after a while).
This may occur after a reboot of the machine too (but I cannot tell this for sure).
Power supplies of both machines were (800W and 1200W). No other boards connected at the same time.

We wonder whether the card is defective.

Please find below the output of some commands:

$ lspci | grep -i nvidia
02:00.0 VGA compatible controller: NVIDIA Corporation GV102 (rev a1)
02:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
02:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
02:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)

$ nvidia-smi
No devices were found # After the board has stopped working - otherwise detected OK

I also tried to enclose a debug report: nvidia-bug-report.log.gz but I am not sure if I made it.

Best regards,

nvidia-bug-report.log.gz (569 KB)

Mounir · April 9, 2019, 1:58pm

hi
you have turing card and ubuntu detect a volta card …you should have this :
VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)

something is wrong with the driver version you installed .
regards

miguel.melchor · April 9, 2019, 2:22pm

Thank you for your quick reply Mounir,

As soon as I have a maintenance slot available I will reinstall the board, check/update driver installation, reboot and come back to you with results.

Regards,

generix · April 9, 2019, 2:45pm

The wrong chip type reported with lspci is harmless, just an older pci-db installed.
Ultimately, you were running into this:

[1114364.893082] NVRM: RmInitAdapter failed! (0x24:0x65:1088)
[1114364.893119] NVRM: rm_init_adapter failed for device bearing minor number 0

First of all, please set the nvidia-persistenced to start on boot, it is needed when running headless or multi-gpu.
If you’re still running into this while having the persistenced running, this is probably a hardware fault, check the card using gpu-burn, check it in another system then RMA.

miguel.melchor · April 16, 2019, 11:22am

Hello generix,

Thank you for your answer. I could finally reinstall the board in the machine and run gpu_burn.

nvidia-persistenced is not running but I guess this is OK since this machine is not headless (we have monitor, keyboard and mouse) and we have only one GPU in this computer.

Please find below the outcome of nvidia-smi and gpu_burn:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

$ ./gpu_burn 120
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-cce1b9f3-d111-81d4-f5d5-1843cb089d29)
Initialized device 0 with 10989 MB of memory (10739 MB available, using 9665 MB of it), using FLOATS
10.8% proc’d: 6020 (12427 Gflop/s) errors: 0 temps: 53 C
Summary at: mar abr 16 12:15:27 CEST 2019

21.7% proc’d: 15652 (12325 Gflop/s) errors: 0 temps: 60 C
Summary at: mar abr 16 12:15:40 CEST 2019

32.5% proc’d: 24682 (12425 Gflop/s) errors: 465258770 (WARNING!) temps: 63 C C
Summary at: mar abr 16 12:15:53 CEST 2019

43.3% proc’d: 34314 (12429 Gflop/s) errors: 1650092454 (WARNING!) temps: 68 C
Summary at: mar abr 16 12:16:06 CEST 2019

53.3% proc’d: 42742 (12429 Gflop/s) errors: -2053590692 (WARNING!) temps: 70 C
Summary at: mar abr 16 12:16:18 CEST 2019

64.2% proc’d: 52374 (12432 Gflop/s) errors: 0 temps: 73 C
Summary at: mar abr 16 12:16:31 CEST 2019

75.0% proc’d: 62006 (12428 Gflop/s) errors: 0 temps: 75 C
Summary at: mar abr 16 12:16:44 CEST 2019

85.8% proc’d: 71036 (12431 Gflop/s) errors: 1465239 (WARNING!) temps: 78 C
Summary at: mar abr 16 12:16:57 CEST 2019

96.7% proc’d: 80668 (12425 Gflop/s) errors: 3748019 (WARNING!) temps: 80 C
Summary at: mar abr 16 12:17:10 CEST 2019

100.0% proc’d: 84280 (12425 Gflop/s) errors: 1344061 (WARNING!) temps: 81 C
Killing processes… done

Tested 1 GPUs:
GPU 0: FAULTY

A further call to nvidia-smi fails to report the board:

$ nvidia-smi
No devices were found

So, can we conclude that the board is defective and we should ask for an RMA?

Best regards,

generix · April 16, 2019, 11:35am

Neither the nvidia-bug-report.log nor the nvidia-smi output reports a running Xserver, so nvidia-persistenced would be necessary.
Doesn’t matter, though since with the results from gpu-burn you can safely assume that the gpu is faulty and RMA it.

miguel.melchor · April 16, 2019, 11:41am

Thank you very much for your help generix

Topic		Replies	Views
2080Ti got ERR soon after starting DL training Linux	11	1764	February 2, 2019
RTX 2080ti -- No devices found when running nvidia-smi Linux hw	3	1636	July 22, 2021
Cannot nvidia-smi Geforce 1070 anymore suddenly. Linux	9	1616	October 12, 2021
nvidia-smi error on Ubuntu18.04 2080 Ti Linux	10	1399	August 2, 2019
Nvidia-smi shows ‘no devices were found’ after RTX 2080 Ti crashed during cuda job Linux	4	1186	March 19, 2020
One of the 4 GPUs (GeForce RTX 2080 Ti) does not show up on nvidia-smi Linux	3	1556	October 14, 2021
Nvidia-smi only show one gpu, but there are two 2080ti on pc Linux	3	1591	September 16, 2022
Kernel Panel and confusion around nvidia-smi CUDA Setup and Installation	4	1501	February 7, 2017
Keep losing RTX 2080 GPU. Linux	3	523	October 1, 2019
Ubuntu 18.04/Drivers 430.50 : 2080ti cannot be used in Linux (ERROR reported in nvidia-smi) Linux	3	648	October 12, 2021

GeForce 2080 RTX ti on Ubuntu 18.04 stops working after a while

Related topics