Driver attachment failure leading to D state processes, hang on reboot (w/ workaround)

Hello,

I’m getting an egpu-for-AI setup going on my laptop, and am seeing an issue where the driver fails to attach to the egpu device, and any subsequent process that attempts to interact with the device enters a D (uninterruptable sleep) state, meaning I have to reboot to resolve it. Shutting down the system hangs as well, though, forcing a full power cycle. dmesg output from this event sent me here.

After a reboot, the egpu is still not recognized by nvidia-smi (though lspci indicates the driver is attached,) until I run something like nvidia-smi -lgc 300,2000, even though the output of that command includes “Setting locked GPU clocks is not supported for GPU <egpu’s pci ID>”, which implies to me that it doesn’t actually change anything. I still appreciate that it finds the card.

At this point, nvidia-smi sees the egpu as expected, and I can use it normally.

The egpu is a P40 Tesla, and the laptop has a GeForce RTX 4050 internally (not used for cuda at all). I’m using the nvidia-driver-550-server package for the driver, which recognizes both cards.

here’s the log taken after the attachment failure: nvidia-bug-report.log.gz (1.2 MB)

[  151.351589] NVRM: The NVIDIA GPU 0000:56:00.0
               NVRM: (PCI ID: 10de:1b38) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[  151.351873] nvidia: probe of 0000:56:00.0 failed with error -1

I guess due to overheating since the Tesla needs external cooling, it doesn’t have fans of its own, it’s built for gpu-servers.