Unable to determine the device handle for GPU0000:1A:00.0: Unknown Error
and discovered a strange bug in my dual-GPU system. Whether I run on a single card, launch two separate processes (one per GPU), or use DDP for fully parallel training, the data-loading speed over PCIe consistently drops, killing overall throughput.
However, once in a blue-moon the “Unknown Error” causes GPU0 to fall off the bus, and I mask it with the following command:
sudo nvidia-smi drain -p 0000:1A:00.0 -m 1
then continue training solely on GPU1, the data-loading speed jumps by roughly 3–6×, and the PCIe link status changes from
Under the same codebase and dataset, the end-to-end runtime of the entire training run is approximately 1 h 10 min (best-case single-GPU performance I’m trying to reproduce), 3 h 20 min (single-GPU), and 6 h (each GPU running the same code independently).
After rebooting, though, I’ve tried some methods to restore that peak speed but nothing works. I’ll attach the full bug report below—any insights would be greatly appreciated!
Thank you for your suggestion.
As I’m currently running some code and cannot open the case right now, I used lspci -Dvt to check the GPU mounts and got the following two outputs:
GPT informed me that these two cards are mounted on different PCIe x16 root ports. Once my code finishes, I’ll open the case to verify the actual physical connections. Thanks again!
I appreciate it may not be feasable, but given the task is heavily impacted by the PCIe bandwidth, I wonder if you’ve considered updating CPU/motherboard, as the 3060 is capable of Gen 4 PCIe?
Thanks for the idea, but swapping out my CPU and motherboard just isn’t realistic—unless they completely die, there’s no real reason to do it right now. It’d cost a fortune and might not even fix things, since I still don’t know what’s causing the issue. If it’s the GPU itself, a new board won’t help. It could easily be something simple like a BIOS setting, slot configuration, or even a dodgy cable/retimer, so I’m really hoping it’s one of those.
Looking at the manual, there’s the following note re. multiple cards;
“When two or more graphics cards are installed, we recommend that you connect the power cable
from the power supply to the VGA_PW connector to ensure system stability.”
VGA_PW is a 6 pin GPU power connector next to the left hand bank of RAM sockets.
“Falling off the bus” is commonly caused by power supply instability and the cards alone total 700W, so this area is worth addressing.
Also, make sure you’re monitoring the PCIe link status while the cards are under load, as it is normal for both Gen and Width to be reduced when a card is idle, as a power saving measure.
Thanks, I saw that too and it’s really helpful, but I’m still not convinced it fully explains the stutters I’m seeing during training. Here are some timing logs from a run:
Time 12.077 (12.077) # the first phase of each epoch is naturally longer
Time 0.176 ( 1.259)
Time 0.179 ( 1.181)
Time 0.176 ( 1.144)
Time 0.176 ( 1.123)
Time 0.179 ( 1.120)
Time 8.361 ( 1.100) # after a few batches in an epoch there’s a sudden stall
That pause actually happens multiple times in each epoch, not just once. Even when I drain the second GPU, I still hit those repeated stalls mid-epoch, so I’m wondering what else might be causing this.
I also opened up the case and everything looks properly seated—each card is in the right slot and the power cables are all hooked up correctly. I might reach out to after-sales support in a few days, but before that I’ll try updating the BIOS and tweaking my PyTorch DataLoader settings—hopefully that’ll help smooth things out.