Dear Concerned,
We are using the Titan Xp GPU in my lab for deep learning training using TF Keras.
Recently, we faced frequently the following error:
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU.
Result of dmesg:
[33472.926944] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
Below the link for the nvidia bug report:
https://www.dropbox.com/s/10nj1ds98565xjd/nvidia-bug-report.log.gz?dl=0
Please advise.
Ali