Trouble with video driver on GeForce GT 710

There is a PCs on CentOS7 in every market spots of my company. There is NVidia video cards on every such PC. We haven’t problems with old cards and old drivers, but all new objects in the same configuration works bad (OS configurations is almost identical because of kickstart file).

After restarting all works fine, PCs plays promotion video over HDMI to TV from mplayer.

But all the time server writes in gdm/0.log such messages:

AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 1 connected from local host ( uid=0 gid=0 pid=7330 )
  Auth name: MIT-MAGIC-COOKIE-1 ID: 675
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 1 disconnected
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 1 connected from local host ( uid=0 gid=0 pid=7332 )
  Auth name: MIT-MAGIC-COOKIE-1 ID: 675
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 1 disconnected
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 1 connected from local host ( uid=0 gid=0 pid=7307 )
  Auth name: MIT-MAGIC-COOKIE-1 ID: 675
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 2 disconnected
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 2 connected from local host ( uid=1003 gid=1003 pid=7342 )
  Auth name: MIT-MAGIC-COOKIE-1 ID: 675
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 2 disconnected
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 2 connected from local host ( uid=1003 gid=1003 pid=7343 )
  Auth name: MIT-MAGIC-COOKIE-1 ID: 675
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 2 disconnected
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 2 connected from local host ( uid=1003 gid=1003 pid=7347 )
  Auth name: MIT-MAGIC-COOKIE-1 ID: 675
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 3 connected from local host ( uid=1003 gid=1003 pid=7350 )
  Auth name: MIT-MAGIC-COOKIE-1 ID: 675
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 4 connected from local host ( uid=1003 gid=1003 pid=7509 )
  Auth name: MIT-MAGIC-COOKIE-1 ID: 675
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 4 disconnected
AUDIT: Fri Jun 22 12:00:52 2018: 7321: client 4 connected from local host ( uid=1003 gid=1003 pid=7510 )
  Auth name: MIT-MAGIC-COOKIE-1 ID: 675

Few hours later (from 5 to 24) HDMI TV gets error “No signal”.

gdm log at this time:

(--) NVIDIA(GPU-0): CRT-0: disconnected
(--) NVIDIA(GPU-0): CRT-0: 400.0 MHz maximum pixel clock
(--) NVIDIA(GPU-0):
(--) NVIDIA(GPU-0): DFP-0: disconnected
(--) NVIDIA(GPU-0): DFP-0: Internal TMDS
(--) NVIDIA(GPU-0): DFP-0: 330.0 MHz maximum pixel clock
(--) NVIDIA(GPU-0):
(--) NVIDIA(GPU-0): RAD BBK TV (DFP-1): connected
(--) NVIDIA(GPU-0): RAD BBK TV (DFP-1): Internal TMDS
(--) NVIDIA(GPU-0): RAD BBK TV (DFP-1): 340.0 MHz maximum pixel clock
(--) NVIDIA(GPU-0):
(--) NVIDIA(GPU-0): CRT-0: disconnected
(--) NVIDIA(GPU-0): CRT-0: 400.0 MHz maximum pixel clock
(--) NVIDIA(GPU-0):
(--) NVIDIA(GPU-0): DFP-0: disconnected
(--) NVIDIA(GPU-0): DFP-0: Internal TMDS
(--) NVIDIA(GPU-0): DFP-0: 330.0 MHz maximum pixel clock
(--) NVIDIA(GPU-0):
(--) NVIDIA(GPU-0): RAD BBK TV (DFP-1): connected
(--) NVIDIA(GPU-0): RAD BBK TV (DFP-1): Internal TMDS
(--) NVIDIA(GPU-0): RAD BBK TV (DFP-1): 340.0 MHz maximum pixel clock

And I don’t see any strange error in /var/log/messages.

Info about hardware:

[root@lavkaf01 ~]# lspci |grep N
01:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GK208 HDMI/DP Audio Controller (rev a1)

Info about software:

[root@lavkaf01 ~]# uname -a
Linux lavkaf01 3.10.0-862.3.2.el7.x86_64 #1 SMP Mon May 21 23:36:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

[root@lavkaf01 ~]# cat /etc/*release
CentOS Linux release 7.5.1804 (Core)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

CentOS Linux release 7.5.1804 (Core)
CentOS Linux release 7.5.1804 (Core)

Now I use driver NVIDIA-Linux-x86_64-390.67.run.

At previous version of driver I got such error more often.

Restarting X (init 3 → init 5) fixes the problem for some time.

Please help to solve this problem outright.

Please check if your issue is the same as this:
[url]https://devtalk.nvidia.com/default/topic/1036758/linux/centos-7-xid-error-69-illegal-class-error-driver-issue/[/url]

May be it’s a similar problem, but in my case errors seems different and X server starts OK, it fails later.
nvidia-bug-report.log.gz (114 KB)

What errors are displayed when it fails? Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post will reveal a paperclip icon.

I’ve attached it to my previous post.

Ok, the error is

nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000917e:0:0

this is indeed a driver bug. See if you can work around it by using the kernel parameter

nvidia-drm.modeset=1

sudo cat /sys/module/nvidia_drm/parameters/modeset
should return ‘Y’ if done right.

Oh my god… Is there some another way to fix this? – This trouble is on the remote machines and if I make a mistake in grub during tests I will should get a car and spend several hours on the road. :'(

Of course I will try this if it’s the only way… But I hope there is another way.

Don’t you have a test machine nearby so you can tell if the workaround is valid beforehand?
Instead of using a kernel parameter you can of course also create a file /etc/modprobe.d/99-modeset.conf containing

options nvidia-drm modeset=1

and update the initramfs afterwards (dracut -f in centos?).

Today and few days more I will work at home.

Oh, thanks. It sounds better. :) I’ll try it tomorrow.

Nope :(. It doesn’t work. :(
[nioliz@lavkaf01 ~]$ sudo cat /sys/module/nvidia_drm/parameters/modeset
Y
[nioliz@lavkaf01 ~]$ sudo grep “Idling display engine timed” /var/log/messages |tail -n 1
Jun 27 07:10:48 [localhost] kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000927d:0:0

Is there any other ideas? Except migrate all markets to ATI?

Now that’s really awful.
To get that fixed, can you pinpoint when the issue started (driver/kernel version)? Also what triggers the bug, from the log:

[84020.197297] rfkill: input handler disabled
[99300.327751] rfkill: input handler enabled
[99302.330689] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000917e:0:0

It is unrelated to rfkill but that is a sign that something is happening, is the TV being turned off?
For an intermediate short term solution, does the legacy 340 driver work?

The issue started when we start to buy NVIDIA Corporation GK208B [GeForce GT 710] (rev a1). There wasn’t problems with NVIDIA Corporation GK208 [GeForce GT 710B] (rev a1).

I tried to update OS and drivers on PC with GK208 and it seems there is no problem on it. But all PC with GK208B gets error.

gdm log:

AUDIT: Mon Jul  2 12:29:06 2018: 27525: client 21 disconnected
nvLock: client timed out, taking the lock
(--) NVIDIA(GPU-0): RAD BBK TV (DFP-1): connected
(--) NVIDIA(GPU-0): RAD BBK TV (DFP-1): Internal TMDS
(--) NVIDIA(GPU-0): RAD BBK TV (DFP-1): 340.0 MHz maximum pixel clock

As I said before, there is no interesting messages in the /var/log/messages:

Jul  1 12:29:47 [localhost] dhcpd: uid lease 192.168.44.18 for client b0:4e:26:9f:59:f0 is duplicate on 192.168.44.0/27
Jul  1 12:29:47 [localhost] dhcpd: DHCPREQUEST for 192.168.44.5 from b0:4e:26:9f:59:f0 via enp4s0
Jul  1 12:29:47 [localhost] dhcpd: DHCPACK on 192.168.44.5 to b0:4e:26:9f:59:f0 via enp4s0
Jul  1 12:29:56 [localhost] dhcpd: DHCPREQUEST for 192.168.44.27 from b0:f1:ec:be:5c:ee via enp4s0
Jul  1 12:29:56 [localhost] dhcpd: DHCPACK on 192.168.44.27 to b0:f1:ec:be:5c:ee via enp4s0
Jul  1 12:30:01 [localhost] systemd: Created slice User Slice of root.
Jul  1 12:30:01 [localhost] systemd: Starting User Slice of root.
Jul  1 12:30:01 [localhost] systemd: Started Session 1918 of user root.
Jul  1 12:30:01 [localhost] systemd: Starting Session 1918 of user root.
Jul  1 12:30:01 [localhost] systemd: Started Session 1919 of user root.
Jul  1 12:30:01 [localhost] systemd: Starting Session 1919 of user root.
Jul  1 12:30:01 [localhost] systemd: Removed slice User Slice of root.
Jul  1 12:30:01 [localhost] systemd: Stopping User Slice of root.
Jul  1 12:30:10 [localhost] dhcpd: Wrote 0 deleted host decls to leases file.
Jul  1 12:30:10 [localhost] dhcpd: Wrote 0 new dynamic host decls to leases file.
Jul  1 12:30:10 [localhost] dhcpd: Wrote 14 leases to leases file.
Jul  1 12:30:10 [localhost] dhcpd: Unable to add forward map from sp02005008514.palf1.local to 192.168.44.25: not found
Jul  1 12:34:48 [localhost] dhcpd: uid lease 192.168.44.18 for client b0:4e:26:9f:59:f0 is duplicate on 192.168.44.0/27
Jul  1 12:35:10 [localhost] dhcpd: Unable to add forward map from sp02005008514.palf1.local to 192.168.44.25: not found
Jul  1 12:39:49 [localhost] dhcpd: uid lease 192.168.44.18 for client b0:4e:26:9f:59:f0 is duplicate on 192.168.44.0/27
Jul  1 12:40:10 [localhost] dhcpd: DHCPREQUEST for 192.168.44.25 from b8:27:eb:14:7a:f4 (sp02005008514) via enp4s0
Jul  1 12:40:10 [localhost] dhcpd: DHCPACK on 192.168.44.25 to b8:27:eb:14:7a:f4 (sp02005008514) via enp4s0
Jul  1 12:40:10 [localhost] dhcpd: Unable to add forward map from sp02005008514.palf1.local to 192.168.44.25: not found
Jul  1 12:43:30 [localhost] dhcpd: DHCPREQUEST for 192.168.44.2 from 00:e0:c5:47:f1:01 via enp4s0
Jul  1 12:43:30 [localhost] dhcpd: DHCPACK on 192.168.44.2 to 00:e0:c5:47:f1:01 via enp4s0

Houston, we have a problem.

Such error appears on other video cards after update of kernel and video driver. And now I have 35 shops with potential problem. Now it’s really big problem.

Driver version is “NVIDIA-Linux-x86_64-340.107.run”.

[root@lavka21 ~]# lspci |grep NV
01:00.0 VGA compatible controller: NVIDIA Corporation GT218 [GeForce 210] (rev a2)

And logs:

AUDIT: Thu Jul  5 10:05:53 2018: 7051: client 21 connected from local host ( uid=1001 gid=1001 pid=7502 )
  Auth name: MIT-MAGIC-COOKIE-1 ID: 632
AUDIT: Thu Jul  5 10:05:53 2018: 7051: client 21 disconnected
(II) NVIDIA(GPU-0): Display (XXX AAA (DFP-1)) does not support NVIDIA 3D Vision
(II) NVIDIA(GPU-0):     stereo.
(**) NVIDIA(0): Using HorizSync/VertRefresh ranges from the EDID for display
(**) NVIDIA(0):     device XXX AAA (DFP-1) (Using EDID frequencies has been
(**) NVIDIA(0):     enabled on all display devices.)
(WW) NVIDIA(GPU-0): The EDID for XXX AAA (DFP-1) contradicts itself: mode
(WW) NVIDIA(GPU-0):     "1920x1080" is specified in the EDID; however, the EDID's
(WW) NVIDIA(GPU-0):     valid HorizSync range (30.000-80.000 kHz) would exclude
(WW) NVIDIA(GPU-0):     this mode's HorizSync (28.1 kHz); ignoring HorizSync check
(WW) NVIDIA(GPU-0):     for mode "1920x1080".
(WW) NVIDIA(GPU-0): The EDID for XXX AAA (DFP-1) contradicts itself: mode
(WW) NVIDIA(GPU-0):     "720x480" is specified in the EDID; however, the EDID's
(WW) NVIDIA(GPU-0):     valid HorizSync range (30.000-80.000 kHz) would exclude
(WW) NVIDIA(GPU-0):     this mode's HorizSync (15.7 kHz); ignoring HorizSync check
(WW) NVIDIA(GPU-0):     for mode "720x480".

I made a script reloading X when it sees an error in logs. But I hope it’s temporary solution. Do somebody know, can I hope for official fix of this problem in nearest driver releases?

Where should I write to make NVidia work to fix this?

While those messages sound terrifying, they’re harmless and just suppressed in newer driver versions. They’re now only shown when modedebug is turned on.

After this error appears screen (TV) just turns off. And does not show promotion video.

Ok, but then they’re rather symptom than cause. Looks like the driver detects a mode switch and rereads the edid. Anything in dmesg?

This might be relevant for your issue:
[url]https://devtalk.nvidia.com/default/topic/1037255/linux/mythtv-errors-lockups-with-vdpau-and-340-or-390-drivers-/[/url]
though there’s a GK208 involved. Seems to be connected to video hardware accel. Since you’re using it for video, can you confirm if and whhich kind of accel you’re using?

Yep, it sounds similar to my problem.

Latest video cards we use are this:
https://www.citilink.ru/catalog/computers_and_notebooks/parts/videocards/1031424/
(ASUS nVidia GeForce 210 , EN210 SILENT/DI/1GD3/V2(LP), 1Гб, DDR3)

Before that we buy this:
https://www.citilink.ru/catalog/computers_and_notebooks/parts/videocards/601762/
(SAPPHIRE AMD Radeon R5 230 , 11233-01-10G, 1Гб, DDR3)

Nioliz, I am the one who created the post that was linked by generix. I think what he was asking was whether you are using hardware acceleration for video playback. I primarily playback MPEG-2 videos using VDPAU. Do you know what hardware acceleration you are using?

Also, are all of your PC’s the same hardware? Mine for example is:
Motherboard: ASUS M5A78L-M LX3 (760G chipset)
CPU: AMD Athlon X2 Dual-Core Processor 270 (3.4 GHz) AM3
RAM: G.SKILL Ripjaws Series 8GB (2 x 4GB) 240-Pin DDR3 1333 (PC3 10666) Model F3-10666CL9D-8GBRL
Video: GT 710 1GD3H LPV1
OS: 64-bit Arch Linux

I am playing both audio and video out of GT 710 HDMI port and video is 1920x1080. I am using the schedutil cpu frequency governor.

Are there any commonalities with this and your systems?

Problem doesn’t exist on Ubuntu, but still actual on CentOS7 with latest updates of OS and video drivers.