one gpu card can not be founded by nvidia-smi

GPU 0: TITAN X (Pascal) (UUID: GPU-ec58a108-5818-441e-1c51-caaaaa5e439a)
GPU 1: TITAN X (Pascal) (UUID: GPU-0abca20e-fcad-3a00-bcc1-7674d0562a2e)
GPU 2: TITAN X (Pascal) (UUID: GPU-067f681c-ba55-ab00-e44a-874b0e5b114c)
Unable to determine the device handle for gpu 0000:84:00.0: GPU is lost. Reboot the system to recover this GPU

“GPU is lost” can occur for any number of reasons which are difficult to diagnose remotely, but mostly seem hardware related. The error message gives precise instructions what you need to do to recover the GPU for now.


(1) Mechanicals (cards plugged properly into the PCIe slot, secured at bracket)
(2) Electricals (sufficiently sized PSU, dirty contacts, cables plugged in properly, trickery in cabling)
(3) Thermals (airflow around the cards, monitor GPU temperatures reported by nvidia-smi under load)
(4) Environmentals (extreme altitude, extreme humidity, electromagnetic interference)
(5) Software (latest system BIOS installed, latest CUDA drivers)

You can physically swap the affected GPU with a neighboring GPU to determine whether problems follow the card or are associated with a particular slot.