X server crashes - GeForce GTX 660 - Driver 418.56 - archlinux

Hi,

(sorry for my broken english)
I recently (2 month ago) installed archlinux on my computer (it was
previously on Debian unstable). Since, I had 4 crashes those seems
related to nvidia driver.
They didn’t happened in the same circumstances, but the symptoms were
the same (strange blocks of pixels on the screen, X hangs, and for 3 of the 4 crashes system
hangs).

Can you please help me to solve this problem, or at least to find a
workaround.

Regards,
JF
nvidia-bug-report.log.gz (1.02 MB)
nvidia-bug-report.log.old.gz (1.02 MB)

Which DE are you using?
Do you get a stable system if you revert to the 390 or 340 legacy drivers?

I’m using cinnamon.

I didn’t tried to revert to an old version, because the crash is not easy to reproduce. There were 3 weeks between the 2 lasts crashs.

It’s hard to tell something definitive, the logs are showing two different crashes, similar but different.

There are some more journactl logs that seems not to be in the report.
journactl_20190205.gz (40.8 KB)

1 more crash, just right now.
Xorg.0.log.old.gz (5.58 KB)
nvidia-bug-report.log.gz (1.02 MB)

Just right now, another crash.
Hope this help…
crash_0222_Xorg.log (33.1 KB)
nvidia-bug-report.log.gz (1.03 MB)

The errors you ran into according to the logs:
XID 8
XID 31+8
XID 13+8
XID 31+56
Really hard to say, always a bit different. Maybe check for faulty video memory using cuda-memtest and gpu-burn.

Unfortunately cuda-memtest seems not working :

$ ocl_memtest
hostname is aragorn
CL_PLATFORM_NAME: NVIDIA CUDA
CL_PLATFORM_VERSION: OpenCL 1.2 CUDA 10.1.113
Device 0 is CL_DEVICE_TYPE_GPU, “GeForce GTX 660”
allocated 1725 Mbytes from device 0
[02/23/2019 20:16:11][aragorn][0]:Test0 [Walking 1 bit]
[02/23/2019 20:16:11][aragorn][0]:Test0: global walk test
ERROR: opencl call failed with rc(-5), line 39, file ocl_tests.cpp
Error: Out of resources

Do I need something more to be able to run the test ?

$ pacman -Qs nvidia
local/cuda_memtest 1.2.3-3
A GPU memory test utility for NVIDIA and AMD GPUs. OpenCL version.
local/lib32-libvdpau 1.1.1-3
Nvidia VDPAU library
local/lib32-nvidia-utils 418.43-1
NVIDIA drivers utilities (32-bit)
local/libvdpau 1.1.1+3+ga21bf7a-1
Nvidia VDPAU library
local/libxnvctrl 418.43-1
NVIDIA NV-CONTROL X extension
local/nvidia-dkms 418.43-2
NVIDIA driver sources for linux
local/nvidia-settings 418.43-1
Tool for configuring the NVIDIA graphics driver
local/nvidia-utils 418.43-1
NVIDIA drivers utilities
local/opencl-nvidia 418.43-1
OpenCL implemention for NVIDIA

Regards,
JF

Juste strated some test with gpu-burn :

$ ./gpu_burn -d 300
GPU 0: GeForce GTX 660 (UUID: GPU-362d83a9-dfec-ae62-fe7d-da8df851203f)
Initialized device 0 with 1994 MB of memory (1703 MB available, using 1533 MB of it), using DOUBLES
11.0% proc’d: 135 (71 Gflop/s) errors: 0 temps: 46 C
Summary at: sam. févr. 23 20:40:58 CET 2019

21.7% proc’d: 225 (71 Gflop/s) errors: 0 temps: 50 C
Summary at: sam. févr. 23 20:41:30 CET 2019

32.7% proc’d: 405 (71 Gflop/s) errors: 0 temps: 53 C
Summary at: sam. févr. 23 20:42:03 CET 2019

43.3% proc’d: 495 (71 Gflop/s) errors: 0 temps: 56 C
Summary at: sam. févr. 23 20:42:35 CET 2019

53.3% proc’d: 630 (71 Gflop/s) errors: 0 temps: 57 C
Summary at: sam. févr. 23 20:43:05 CET 2019

65.0% proc’d: 765 (71 Gflop/s) errors: 0 temps: 58 C
Summary at: sam. févr. 23 20:43:40 CET 2019

76.0% proc’d: 945 (71 Gflop/s) errors: 0 temps: 59 C
Summary at: sam. févr. 23 20:44:13 CET 2019

86.7% proc’d: 1035 (71 Gflop/s) errors: 0 temps: 59 C
Summary at: sam. févr. 23 20:44:45 CET 2019

98.0% proc’d: 1215 (71 Gflop/s) errors: 0 temps: 60 C
Summary at: sam. févr. 23 20:45:19 CET 2019

100.0% proc’d: 1260 (71 Gflop/s) errors: 0 temps: 60 C
Killing processes… done

Tested 1 GPUs:
GPU 0: OK

Looks like arch only provides the OCL version of cuda-memtest and it’s broken:
https://aur.archlinux.org/packages/cuda_memtest/
So you would have to manually install cuda and cuda-memtest and use cuda_memtest instead of ocl_memtest.
The results of gpu-burn look good though so I don’t know if cuda_memtest would bring up any new info.

One more crash.
This time, it takes a few seconds before the system freeze. I saw the CPU usage curve growing to 100%. i’m not sure this crash is related to the driver…
There are the files collected :
nvidia-bug-report.log.gz (1.09 MB)
crash_0306_Xorg.log (31.7 KB)

The gpu/driver wasn’t involved in this crash.
Taken into account that in previous crashes the gpu was involved but always differently and the gpu-burn test ran fine, I’ll suspect a hw issue but not the gpu. Maybe some subtle system memory fault or a breaking psu or even harddrive. IDK, very hard to say. It’ll probably get worse until the faulty part breaks completely, so you’ll know by then.

Once again… and this time it was related to the GPU. The screen freeze with stranges pixels.

crash_0307_Xorg.log (30.7 KB)
nvidia-bug-report.log.gz (1.09 MB)

Hello,

I have gone though bug report attached in comment #14 and observed you are getting Xid error code 62.
I would like to reproduce issue internally and hence need detailed steps to reproduce issue.
Moreover, please provide dmidecode output as well.

Hello,

I have not identified a special way to reproduce this issue.
Anyway, there is the dmidecode output :
dmidecode_output.txt (22.7 KB)

Hi again,

15 days since my last crash…

There are the logs and a picture of the screen :
nvidia-bug-report.log.gz (1.1 MB)

Hi,

Once more. Exactly the same as the last crash.

nvidia-bug-report.log.gz (1.1 MB)

Another crash today.
(I’ve updated the title of the post with my current version of nvidia driver).
nvidia-bug-report.log.gz (1.1 MB)

Hello,

Almost one month since the last crash, but the problem is still there.
nvidia-bug-report.log.gz (1.1 MB)