And I can’t run nvidia-bug-report.sh because it hangs. This only happens when I try to run with both gpus. I tried each invidually on the system and both seem to work fine (trained DL model overnight without issues, with either). Also the very same motherboard used to run two RTX 2080 without any problems.
I’m attaching the dmesg output as an alternative nvidia-bug-report.sh (also tried with --safe-mode but no luck). dmesg (23.4 KB)
Please check for a bios update first.
You’re running into an XID 62 which by itself doesn’t tell much. Taking into account that Ampere gen gpus are very susceptible to memory overheating, this might be the reason when using both gpus (heating up each other). Unfortunately, it’s so far not possible to monitor memory temperature on Linux https://forums.developer.nvidia.com/t/request-gpu-memory-junction-temperature-via-nvidia-smi-or-nvml-api/168346?u=generix
So you can only monitor gpu temperature using nvidia-smi.
GPU heating up RE: right now I run the second GPU over a pcie extender cable (purchased two different ones), so I can’t imagine how they would heat each other since they are apart.
I will try the bios update and update this thread.