I’ve been experiencing issues with my system when trying dual-gpu. I get messages like these:
kernel:[ 164.125086] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [irq/111-nvidia:1844]
And I can’t run nvidia-bug-report.sh because it hangs. This only happens when I try to run with both gpus. I tried each invidually on the system and both seem to work fine (trained DL model overnight without issues, with either). Also the very same motherboard used to run two RTX 2080 without any problems.
I’m attaching the dmesg output as an alternative nvidia-bug-report.sh (also tried with --safe-mode but no luck).
dmesg (23.4 KB)
Processor: AMD Ryzen 5 3600 6-Core Processor
Driiver version: 460.84
Please check for a bios update first.
You’re running into an XID 62 which by itself doesn’t tell much. Taking into account that Ampere gen gpus are very susceptible to memory overheating, this might be the reason when using both gpus (heating up each other). Unfortunately, it’s so far not possible to monitor memory temperature on Linux
So you can only monitor gpu temperature using nvidia-smi.
Thanks for your quick answer.
GPU heating up RE: right now I run the second GPU over a pcie extender cable (purchased two different ones), so I can’t imagine how they would heat each other since they are apart.
I will try the bios update and update this thread.
Since you’re using pcie extenders, did you try lowering pcie speed?
Happy to report this was a BIOS issue; after upgrading to the latest version everything works fine. Thank you .
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.