"The NVIDIA X driver has encountered an error; attempting to recover..." in Xorg.0.log and Xorg instability, is this hardware or software issue?

My Xorg.0.log is full of errors like this:

[    37.627] (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00002390, 0x00003620)
[    37.856] (EE) NVIDIA(0): The NVIDIA X driver has encountered an error; attempting to
[    37.856] (EE) NVIDIA(0):     recover...

And dmesg seems to contain more information:

[   38.010652] nvgpu: 57000000.gpu   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 8 for ch 507
[   38.020849] nvgpu: 57000000.gpu     gk20a_fifo_handle_sched_error:2531 [ERR]  fifo sched ctxsw timeout error: engine=0, tsg=4, ms=3100
...
[   38.474067] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[   54.990358] nvgpu: 57000000.gpu gk20a_fifo_handle_mmu_fault_locked:1726 [ERR]  gr_status_r : 0x81
[   55.000554] nvgpu: 57000000.gpu                    fifo_error_isr:2605 [ERR]  channel reset initiated from fifo_error_isr; intr=0x00000100
[   98.815590] nvgpu: 57000000.gpu   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 8 for ch 507
[   98.825793] nvgpu: 57000000.gpu     gk20a_fifo_handle_sched_error:2531 [ERR]  fifo sched ctxsw timeout error: engine=0, tsg=4, ms=3100
[   98.837968] ---- mlocks ----
...

I have attached full logs: dmesg.log (125.3 KB) Xorg.0.log (15.5 KB)

I have 2560x1440 monitor attached (actually it is 1440x2560 rotated with xrandr, I used suggestion by NVidia employee here 1440x2560 HDMI display not working to fix hdmi2.0.c and recompile the kernel; I did not do any other changes to the kernel).

After a while, Xorg freezes and uses about 100% CPU. Any ideas if this is hardware or software issue? The monitor works fine with NVidia card on my PC (GTX 2060 SUPER 8GB) and even with Raspberry Pi, so I guess monitor’s hardware is OK.

Hi,

It is a known issue on rel-32.4.2 release. We are still working on it.

You could try rel-32.3.1 to avoid such error at this moment.

Thank you very much. I downloaded https://developer.nvidia.com/embedded/dlc/r32-3-1_Release_v1.0/Sources/T210/public_sources.tbz2, and after I compiled and installed the kernel from 32.3.1 release, and the issue is gone, X seems to be stable and no GPU errors so far.

Hi WayneWWW,

I am seeing the above error as well, but would like to continue using the 32.4.2 release as I am now depending on some other fixes present in this release. Do you have any further updates on this issue?

Thanks,

Chris Richardson

Hi,

Could you share a full dmesg ?

And also tell us how to reproduce this issue.

Hey WayneWWW,

Sorry to respond so late to this post. I didn’t receive a notification for it for some reason. To reproduce it I just flash my Nano eMMC-based module with L4T 32.4.2 and the errors show up on the serial console. The errors repeat regularly if that matters. This did not happen on the SD card based module with 32.4.2.

DmesgLog_2020-06-16_01_UbuntuDisplayProblem01.txt (55.8 KB)

Thanks,

Chris Richardson

Yes, this issue is a known one that only happens to emmc based module.