(sorry for my broken english)
I recently (2 month ago) installed archlinux on my computer (it was
previously on Debian unstable). Since, I had 4 crashes those seems
related to nvidia driver.
They didn’t happened in the same circumstances, but the symptoms were
the same (strange blocks of pixels on the screen, X hangs, and for 3 of the 4 crashes system
hangs).
Can you please help me to solve this problem, or at least to find a
workaround.
The errors you ran into according to the logs:
XID 8
XID 31+8
XID 13+8
XID 31+56
Really hard to say, always a bit different. Maybe check for faulty video memory using cuda-memtest and gpu-burn.
$ ocl_memtest
hostname is aragorn
CL_PLATFORM_NAME: NVIDIA CUDA
CL_PLATFORM_VERSION: OpenCL 1.2 CUDA 10.1.113
Device 0 is CL_DEVICE_TYPE_GPU, “GeForce GTX 660”
allocated 1725 Mbytes from device 0
[02/23/2019 20:16:11][aragorn][0]:Test0 [Walking 1 bit]
[02/23/2019 20:16:11][aragorn][0]:Test0: global walk test
ERROR: opencl call failed with rc(-5), line 39, file ocl_tests.cpp
Error: Out of resources
Do I need something more to be able to run the test ?
$ pacman -Qs nvidia
local/cuda_memtest 1.2.3-3
A GPU memory test utility for NVIDIA and AMD GPUs. OpenCL version.
local/lib32-libvdpau 1.1.1-3
Nvidia VDPAU library
local/lib32-nvidia-utils 418.43-1
NVIDIA drivers utilities (32-bit)
local/libvdpau 1.1.1+3+ga21bf7a-1
Nvidia VDPAU library
local/libxnvctrl 418.43-1
NVIDIA NV-CONTROL X extension
local/nvidia-dkms 418.43-2
NVIDIA driver sources for linux
local/nvidia-settings 418.43-1
Tool for configuring the NVIDIA graphics driver
local/nvidia-utils 418.43-1
NVIDIA drivers utilities
local/opencl-nvidia 418.43-1
OpenCL implemention for NVIDIA
Looks like arch only provides the OCL version of cuda-memtest and it’s broken:
[url]https://aur.archlinux.org/packages/cuda_memtest/[/url]
So you would have to manually install cuda and cuda-memtest and use cuda_memtest instead of ocl_memtest.
The results of gpu-burn look good though so I don’t know if cuda_memtest would bring up any new info.
One more crash.
This time, it takes a few seconds before the system freeze. I saw the CPU usage curve growing to 100%. i’m not sure this crash is related to the driver…
There are the files collected : nvidia-bug-report.log.gz (1.09 MB) crash_0306_Xorg.log (31.7 KB)
The gpu/driver wasn’t involved in this crash.
Taken into account that in previous crashes the gpu was involved but always differently and the gpu-burn test ran fine, I’ll suspect a hw issue but not the gpu. Maybe some subtle system memory fault or a breaking psu or even harddrive. IDK, very hard to say. It’ll probably get worse until the faulty part breaks completely, so you’ll know by then.
I have gone though bug report attached in comment #14 and observed you are getting Xid error code 62.
I would like to reproduce issue internally and hence need detailed steps to reproduce issue.
Moreover, please provide dmidecode output as well.