Hello,
I’m getting an egpu-for-AI setup going on my laptop, and am seeing an issue where the driver fails to attach to the egpu device, and any subsequent process that attempts to interact with the device enters a D (uninterruptable sleep) state, meaning I have to reboot to resolve it. Shutting down the system hangs as well, though, forcing a full power cycle. dmesg output from this event sent me here.
After a reboot, the egpu is still not recognized by nvidia-smi (though lspci indicates the driver is attached,) until I run something like nvidia-smi -lgc 300,2000, even though the output of that command includes “Setting locked GPU clocks is not supported for GPU <egpu’s pci ID>”, which implies to me that it doesn’t actually change anything. I still appreciate that it finds the card.
At this point, nvidia-smi sees the egpu as expected, and I can use it normally.
The egpu is a P40 Tesla, and the laptop has a GeForce RTX 4050 internally (not used for cuda at all). I’m using the nvidia-driver-550-server package for the driver, which recognizes both cards.
here’s the log taken after the attachment failure: nvidia-bug-report.log.gz (1.2 MB)