GeForce 2080 RTX ti on Ubuntu 18.04 stops working after a while

Hello,

I have been redirected here by NVIDIA Customer Care (incident 190227-000272).

We have a GEFORCE RTX 2080 Ti (s/n 0324118055234) bought from NVIDIA on a machine with Ubuntu 18.04.
We intend to use the board for Deep Learning computing (nvidia driver 415, CUDA10).
At times the board is properly reported by the nvidia-smi command. We launch our computations,
the board starts working but after a while (say 30 min) the card stops (crashes ?) and is not detected anymore by nvidia-smi command.

In an attempt to troubleshoot the problem we installed the board on another Linux machine (Ubuntu 18.04 also) but the behavior was exactly the same (available at first, failed after a while).
This may occur after a reboot of the machine too (but I cannot tell this for sure).
Power supplies of both machines were (800W and 1200W). No other boards connected at the same time.

We wonder whether the card is defective.

Please find below the output of some commands:

$ lspci | grep -i nvidia
02:00.0 VGA compatible controller: NVIDIA Corporation GV102 (rev a1)
02:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
02:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
02:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)

$ nvidia-smi
No devices were found # After the board has stopped working - otherwise detected OK

I also tried to enclose a debug report: nvidia-bug-report.log.gz but I am not sure if I made it.

Best regards,

nvidia-bug-report.log.gz (569 KB)

hi
you have turing card and ubuntu detect a volta card …you should have this :
VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)

something is wrong with the driver version you installed .
regards

Thank you for your quick reply Mounir,

As soon as I have a maintenance slot available I will reinstall the board, check/update driver installation, reboot and come back to you with results.

Regards,

The wrong chip type reported with lspci is harmless, just an older pci-db installed.
Ultimately, you were running into this:

[1114364.893082] NVRM: RmInitAdapter failed! (0x24:0x65:1088)
[1114364.893119] NVRM: rm_init_adapter failed for device bearing minor number 0

First of all, please set the nvidia-persistenced to start on boot, it is needed when running headless or multi-gpu.
If you’re still running into this while having the persistenced running, this is probably a hardware fault, check the card using gpu-burn, check it in another system then RMA.

Hello generix,

Thank you for your answer. I could finally reinstall the board in the machine and run gpu_burn.

nvidia-persistenced is not running but I guess this is OK since this machine is not headless (we have monitor, keyboard and mouse) and we have only one GPU in this computer.

Please find below the outcome of nvidia-smi and gpu_burn:

$ nvidia-smi
Tue Apr 16 12:08:22 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 415.27 Driver Version: 415.27 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… Off | 00000000:02:00.0 Off | N/A |
| 22% 44C P0 1W / 260W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

$ ./gpu_burn 120
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-cce1b9f3-d111-81d4-f5d5-1843cb089d29)
Initialized device 0 with 10989 MB of memory (10739 MB available, using 9665 MB of it), using FLOATS
10.8% proc’d: 6020 (12427 Gflop/s) errors: 0 temps: 53 C
Summary at: mar abr 16 12:15:27 CEST 2019

21.7% proc’d: 15652 (12325 Gflop/s) errors: 0 temps: 60 C
Summary at: mar abr 16 12:15:40 CEST 2019

32.5% proc’d: 24682 (12425 Gflop/s) errors: 465258770 (WARNING!) temps: 63 C C
Summary at: mar abr 16 12:15:53 CEST 2019

43.3% proc’d: 34314 (12429 Gflop/s) errors: 1650092454 (WARNING!) temps: 68 C
Summary at: mar abr 16 12:16:06 CEST 2019

53.3% proc’d: 42742 (12429 Gflop/s) errors: -2053590692 (WARNING!) temps: 70 C
Summary at: mar abr 16 12:16:18 CEST 2019

64.2% proc’d: 52374 (12432 Gflop/s) errors: 0 temps: 73 C
Summary at: mar abr 16 12:16:31 CEST 2019

75.0% proc’d: 62006 (12428 Gflop/s) errors: 0 temps: 75 C
Summary at: mar abr 16 12:16:44 CEST 2019

85.8% proc’d: 71036 (12431 Gflop/s) errors: 1465239 (WARNING!) temps: 78 C
Summary at: mar abr 16 12:16:57 CEST 2019

96.7% proc’d: 80668 (12425 Gflop/s) errors: 3748019 (WARNING!) temps: 80 C
Summary at: mar abr 16 12:17:10 CEST 2019

100.0% proc’d: 84280 (12425 Gflop/s) errors: 1344061 (WARNING!) temps: 81 C
Killing processes… done

Tested 1 GPUs:
GPU 0: FAULTY

A further call to nvidia-smi fails to report the board:

$ nvidia-smi
No devices were found

So, can we conclude that the board is defective and we should ask for an RMA?

Best regards,

Neither the nvidia-bug-report.log nor the nvidia-smi output reports a running Xserver, so nvidia-persistenced would be necessary.
Doesn’t matter, though since with the results from gpu-burn you can safely assume that the gpu is faulty and RMA it.

Thank you very much for your help generix