On a fresh install of Ubuntu 24, I can’t open any applications without the system freezing (no audio output, can’t switch tty) and forcing a hard reset, giving me a XID 79 Falling off the Bus error. Attached is the nvidia bug report log.
nvidia-bug-report.log.gz (419.9 KB)
Furthermore, testing on Windows seems to work fine, although, I do get the occasional freeze similar to the one I constantly get on Ubuntu. Checking Event Viewer only shows “Event 41 Unexpected Shutdown”. Does this GPU require RMA?
In the words of an NV eng:
Not related to this topic, but I have XID issue, but only when Intel XESS enabled somehow. Can’t reproduce this on windows/DLSS on either platform tho..
Not sure what to blame at this point tbh.
Oddly enough there’s not a pattern of it happening under immense load. In a Linux environment, it happens not even a few minutes after booting usually after I click the Firefox application for example. In a windows environment, it happens spontaneously. I highly doubt it’s a PSU issue here after I was able to run a ComfyUI workflow for hours on end on Windows.
In my case it was a broken motherboard.
@hunter.han if it happens only on Linux but not ever on Windows, then it’s probably the “BIOS” option (he should have really written “firmware” more generally). Check if there are any BIOS/UEFI firmware updates for your mobo.
My BIOS is on the latest version. It does happen on Windows but very occasionally, after trying another GPU and stress testing it (the XID 79 GPU wouldn’t even survive opening the stress testing software let alone starting the test on Ubuntu) I am more sure it is not my PSU.
Windows driver may have a lower power cap by default or/and a better cooling profile (just a speculation though: I don’t even know how any Windows newer than XP looks like). Anyway you can try to set your fan speed target to 70% and lower the max power draw by ~50W on Linux right after booting and see if it helps. You can use nvidia-smi for the power cap (--power-limit option) and probably also the fan speed but I have never played with that. For sure I have seen fan speed settings in nvida-settings’s GUI in Thermal Settings section (need to run is as root).
Thanks for the recommendations but that didn’t seem to do much. I also could not change the fan speed even in root for some reason. For context I’m using nvidia-590-open. Just to test, I went into Windows and ran one of my AI workflows while measuring the 16-pin power usage and it hit a peak of 645 Watts with no issues.
fans and their control are provided by board producers (ASUS, PNY, Zotac etc), not by Nvidia, so it is possible (though unlikely I must say) that your card has it done in a non-standard way, handled by the board designer’s software on Windows and preventing the standard Linux driver from controlling the fans. This would explain the situation: on Windows the custom software keeps the card cool, while on Linux it overheats due to the broken control mechanism. Have you monitored the thermals on Linux? What brand and model exactly do you have?
I have a PNY 5090, the thermals never go past 35 degrees Celsius before crashing. That’s interesting though, I noticed the fans never spin up, however I wasn’t able to get any gpu benchmark to run before it freezes. I’ll keep trying though.
You may be onto something, after closer inspection, my motherboard is running the GPU at x1 speeds. I theorize this PCIE slot is broken and is unable to keep a steady connection with the graphics card.