RTX 2080 Ti doesn't work for me on Fedora 30

StephenAdler · May 22, 2019, 12:30pm

I have three nvidia cards I’ve been using on my system. A P2000, GTX 1080 Ti and the most recent an RTX 2080 Ti. The P2000 and GTX 1080 Ti (pascal architectures) work fine on my system. It’s a Gigabyte MW51-HP0 motherboard, Xeon W class processor. The RTX card will freeze my system. I did a test in which I booted up into windows 10 and played Assassin creed Odyssey for about an hour and the system worked fine. No glitches, no hang ups, nothing. Just played the video game fine. I assume playing a video game would push the limits of the power supply, motherboard, memories etc. So if there was some kind of hardware problem with the card, it would show up playing a video game. The GPU was running at 95%.

Anyway, I then booted into linux (fedora 30) and after starting up firefox, going to youtube and playing a video, the system hung in about 5 minutes. The X server freezes up and then I have to do a hard reset. There have been other times in which I can ssh in, after the desktop freezes, and the Xorg server is running at 100% CPU as if it’s stuck in an infinite loop. There have been other times when I log in and there is a job with the string ‘irq’ in the name which is running at 100% CPU. And then as I said there are times when the system freezes up and I can’t ssh in to see what’s going on. (I ssh in from my laptop).

Has anyone had these kinds of problems with RTX cards? I don’t know if this is a fedora 30 specific problem, since I just upgraded to fedora 30 and tried out the card only on fedora 30. The GTX and P2000 cards work just fine. Also, the RTX card hangs seem to be correlated when I play a video through firefox. So perhaps it could be the video decoding which is screwing up the Xorg server?

This RTX problem has occurred with 418.56. 418.74 430.09 and 430.14 drivers.

any help is appreciated.
nvidia-bug-report.log.gz (1.13 MB)

generix · May 22, 2019, 5:05pm

Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
[url]https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/[/url]

StephenAdler · May 25, 2019, 3:00am

I’ve attached an nvidia-bug-report.log.gz

The only thing is that the bug report was generated when the system was running normaly. I’ll try and get the Xorg X11 server to hang and try to generate another bug report…

StephenAdler · May 25, 2019, 3:31am

OK, I ran a GPU intensive test and got the system to hang. Basically I ran 3 4k videos on youtube, one for each of my monitors, then fired up my windows 10 VMWare client and fullscreened into my three monitors on a different desktop, then I fired up my linux guest, but didn’t go into full screen mode. then I took down two of the three 4K videos and their repsective firefox windows. At that point the Xorg server hung. I logged into my system from my laptop and generated the following bug report.

nvidia-bug-report.log.gz (628 KB)

StephenAdler · May 25, 2019, 3:36am

now before I did this test, I did another set of tests which hung my computer right away. Basically in one, I logged in and the system hung basically right away. Then there was one in which the system hung at the gdm login screen. I couldn’t log into the system to generate a bug report since the sshd service was not running. I didn’t know this at the time and I thought the system was just hung hard. In anycase, I then thought maybe I did have a hardware problem, so I then booted up into windows 10 and played my Assassins Creed Odessy. I played for about 15 minutes and the system didn’t hickup. So I figured that if before this windows 10 test, the system hung in under a minute under linux, just trying to log in, and it didn’t hang after playing 15 minutes of a graphics intense video game on Windows, then the problem isn’t hardware.

I then booted back into my linux environment, figured out that I had not enabled the sshd service, eanble and started it, then started the test in previous message. For some reason after starting up the ssd service, the system seemed more stable, but I think that’s a false coincidence.

I hope someone can take a look at the log files.

generix · May 25, 2019, 10:16am

The log is flooded with pcie bus errors:

[14214.857734] {176}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[14214.857737] {176}[Hardware Error]: It has been corrected by h/w and requires no further action
[14214.857739] {176}[Hardware Error]: event severity: corrected
[14214.857741] {176}[Hardware Error]:  Error 0, type: corrected
[14214.857742] {176}[Hardware Error]:   section_type: PCIe error
[14214.857744] {176}[Hardware Error]:   port_type: 5, upstream switch port
[14214.857745] {176}[Hardware Error]:   version: 3.0
[14214.857747] {176}[Hardware Error]:   command: 0x0147, status: 0x0010
[14214.857748] {176}[Hardware Error]:   device_id: 0000:b3:00.0
[14214.857749] {176}[Hardware Error]:   slot: 0
[14214.857750] {176}[Hardware Error]:   secondary_bus: 0xb4
[14214.857752] {176}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x8747
[14214.857753] {176}[Hardware Error]:   class_code: 000406
[14214.857754] {176}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x001b
[14214.857855] pcieport 0000:b3:00.0: aer_status: 0x0000b1c1, aer_mask: 0x00002000
[14214.857858] pcieport 0000:b3:00.0:    [ 0] RxErr                 
[14214.857860] pcieport 0000:b3:00.0:    [ 6] BadTLP                
[14214.857862] pcieport 0000:b3:00.0:    [ 7] BadDLLP               
[14214.857864] pcieport 0000:b3:00.0:    [ 8] Rollover              
[14214.857866] pcieport 0000:b3:00.0:    [12] Timeout               
[14214.857868] pcieport 0000:b3:00.0:    [15] HeaderOF              
[14214.857870] pcieport 0000:b3:00.0: aer_layer=Physical Layer, aer_agent=Transmitter ID

So this rather looks like a hardware error.
The fact that it doesn’t crash when using Windows doesn’t necessarily mean that it’s working properly. Windows might limit the pcie bus to gen1 or gen2 speeds on first error. You should check the event log.
First things to try:

reseat the card in the slot, maybe even multiple times in case of dirt on slot contacts
check for a bios update
If that doesn’t resolve the issue, you should check if the card works properly in another system.
For a workaround, you can test if limiting the pcie to gen2 speeds in bios results in a stable system.

StephenAdler · May 25, 2019, 2:03pm

I want to thank generix for pointing out the hardware error in the logs. I assume that was from dmesg. In any case, I’ve updated the bios and the system seems to be working better. But… I still got a hang after a while. It occured when I fired up my second windows 10 vmware client. I had it in full screen mode, watching a 4k youtube video. When it froze, I ran dmseg --follow and saw that there was something to do with the GPU and the kernel trying to do something. Hopefully it was captured in the nvidia bug report. I didn’t see any hardware errors though so I’m hoping the bios update got rid of them.

nvidia-bug-report.log.gz (709 KB)

generix · May 25, 2019, 2:37pm

Seems the bios update fixed the pcie bus errors.
Ultimately, the gpu is running into an XID 61 which isn’t helpful, might be just about anything.
Please create a clocking and temperature log until crash:
nvidia-smi -l 2 -q -d TEMPERATURE,CLOCK -f nvtemp.log
to see if unusual values appear.

StephenAdler · May 25, 2019, 5:15pm

Hi Generix,

I’ve rebooted and I’m now running your nvidia-smi command. The system has been up for several hours now. The last hang occured when I was running vmware client in windows 10 full screen mode. VMWare workstaiton pro 15 (which is what I’m running) uses the GPU of the video card to power its clients displays. So I wonder when I was running the youtube 4k video in full screen mode, what tripped the video card was VMWare’s interaction with the GL libraries. In any case, the system is still running right now…

StephenAdler · May 25, 2019, 5:51pm

The system hung, now it looks like an IRQ issue.

Tasks: 430 total,   4 running, 426 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.4 us, 28.0 sy,  0.0 ni, 66.3 id,  0.0 wa,  0.1 hi,  0.1 si,  0.0 st
MiB Mem :  64114.4 total,  21152.3 free,   5067.5 used,  37894.7 buff/cache
MiB Swap:  33791.0 total,  33791.0 free,      0.0 used.  45491.3 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 6909 adler     20   0   11.9g   8.9g   8.2g R 293.7  14.2 122:35.60 vmware-vmx
32129 adler     20   0 8484524   4.8g   4.3g S 141.9   7.6 104:11.42 vmware-vmx
 4010 root     -51   0       0      0      0 R  99.7   0.0   7:04.39 irq/108-n+
 8197 adler     20   0 2331696 480252 150372 S   1.0   0.7   5:03.61 Web Conte+
26078 adler     20   0  648924  50076  34408 S   1.0   0.1   0:10.24 gnome-ter+
 4516 adler     20   0 4714216 826532 118320 S   0.7   1.3   6:49.11 gnome-she+
 7996 adler     20   0 2149780 493184 215332 S   0.7   0.8   4:35.67 Web Conte+
  773 root      20   0       0      0      0 S   0.3   0.0   0:04.31 md1_raid5
  961 root      20   0  120616  80220  77780 S   0.3   0.1   0:02.58 systemd-j+
 4840 adler     20   0 2833588  35940  25344 S   0.3   0.1   1:17.28 cpufreq-s+
 6111 root      20   0       0      0      0 I   0.3   0.0   0:00.25 kworker/u+
 7911 adler     20   0 2990540 722304 356280 S   0.3   1.1   4:26.36 firefox
 9470 root      20   0  227364   4756   3720 R   0.3   0.0   0:00.03 top
22202 adler     20   0 1845224 348508 242620 S   0.3   0.5   0:45.03 Web Conte+
26580 root      20   0  215412    992    880 S   0.3   0.0   0:00.75 dmesg
    1 root      20   0  171988  15312   9648 S   0.0   0.0   0:02.98 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.01 kthreadd

notice the irq/108-n+ job is at 99.7% CPU time.

what follows is the output of dmesg. I’m runing the nvidia bug report tool, but since dmesg is just spewing out messages, it’s taking very very long to generate. I may have to kill it.

[11877.516389] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000
[11877.521810] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000
[11877.527343] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000
[11877.533138] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000
[11877.538433] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000
[11877.543921] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000
[11877.549434] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000
[11877.555196] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000
[11877.560499] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000
[11877.565919] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000
[11877.573539] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000
[11877.579367] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000
[11877.584707] NVRM: Xid (PCI:0000:b6:00): 32, Channel ID 00000019 intr0 00000000 intr1 80000000

If I can get a bug report log file generated I’ll upload it.

generix · May 25, 2019, 5:53pm

One thing I noticed from your last logs: the gpu was idling in an error state yet the gpu temperature was at 59°C which is bad. The 2080Ti is very touchy when it comes to heat, better check if something is blocking airflow.

StephenAdler · May 25, 2019, 5:54pm

Also, what caused it to hang this time was my going out of full screen mode in my windows 10 vmware client. When I clicked on the “get out of full screen mode” icon, the screen froze. before that I was trying to get the screen to freeze by looking at ign video reviews in full screen mode inside my windows 10 vmware client. I did that for about 15 minutes and it was running fine… until I stopped and tried to get out of full screen mode.

generix · May 25, 2019, 5:55pm

Now XID 32 is something completely different, would rather point to faulty system memory.

generix · May 25, 2019, 5:57pm

Maybe use gpu-burn for 10 minutes to check the gpu/temperatures.

StephenAdler · May 25, 2019, 6:00pm

Hi Generix, I wasn’t able to generate the bug report log file. One thing I noticed was the nvidia-smi hung. I wonder if the command tries to extract info from the nvidia card through the driver, the driver is in an infinite loop servicing IRQs causing the nvidia-smi command to hang, as well as the bug report generation tool. In any case, I attached the nvtemp.log.gz file.
nvtemp.log.gz (89.7 KB)

StephenAdler · May 25, 2019, 6:06pm

Does gpu-burn require the full CUDA development system to build and run? Is there an rpm package I can use to install? Should I run it from windows 10? (my system is dual boot.)

StephenAdler · May 25, 2019, 6:12pm

[  643.286314] NVRM: GPU at PCI:0000:b6:00: GPU-03d24ee9-7ae0-dea7-f5eb-3f50c2acd242
[  643.286317] NVRM: GPU Board Serial Number:
[  643.286318] NVRM: Xid (PCI:0000:b6:00): 61, 0cb5(2d50) 00000000 00000000
[  653.126911] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[  655.126894] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[  667.718830] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[  669.718822] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[  680.006800] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[  682.006802] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[  692.294779] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Just hung again with XID 61. This time it was only 692 seconds after boot. Clicking around firefox to download gpu-burn caused it to hang… do you need the bug report when I get this error next time? I’m going to reboot.

I have another RTX 2080 Ti. I’m going to swap out the cards to see if its a card specific problem.

generix · May 25, 2019, 6:23pm

No further bug report needed, the error is mostly consistent, XID 61 with the one-time exception XID 32.
There doesn’t seem to be a rpm for gpu-burn and building it might be a big hassle for you, Windows 10 would require to install the full build chain and Fedora 30’s system compiler is gcc 9 while cuda 10.1 AFAIK only supports up to gcc 8.

StephenAdler · May 25, 2019, 6:31pm

If I run a video game for a few hours?

I swapped RTX 2080 Ti cards, and I got the XID 61 error within seconds after logging in. So I guess it’s not the card. My desktop setup is 3 Dell 4K 32" monitors which have the DP 1.4 interface. I’m currently running now with only 1 monitor on to see if that makes a difference.

I’ll poke around for a gpu-burn like utility. It looks like there may be a few out there. I have a rhel 8 system and a rhel 7 system with which I may be able to build the gpu-burn and then copy the executable over to my fedora 30 desktop. but will that need the CUDA libraries on my fedora 30?

generix · May 25, 2019, 6:47pm

Yes, running gpu-burn requires to have the cuda-toolkit installed.

Topic		Replies	Views
Random Xid 61 and Xorg lock-up Linux	406	31494	January 8, 2023
Recovered GPU Errors in nvidia-settings Linux	10	19974	October 10, 2014
RTX 3070 Ti falls off the bus on Razer Blade 15 2022 Linux	20	2320	October 24, 2023
High CPU usage on xorg when the external monitor is plugged in Linux	120	37391	June 21, 2023
Arbitrary Crashes / Segfaults with RTX 3070 on current driver-455 on Ubuntu 20.04 kernel 5.4.0-58-generic Linux	23	2116	February 25, 2021
GPU timeout \| lockup Linux	14	858	July 7, 2024
Reproducible: NVRM: GPU at 0000:01:00.0 has fallen off the bus. -- Both screens black, Xorg at 100% Linux	24	50899	December 16, 2015
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus - HP Studio G5 Linux	38	10195	April 9, 2020
Dramatic overall performance and heat generation with GeForce GTX 1070 with Max-Q Design Linux	18	6365	October 2, 2018
NVIDIA driver has a habit of keeping my GPU at the highest performance level Linux	23	5925	July 8, 2022

RTX 2080 Ti doesn't work for me on Fedora 30

Related Topics