We are using the Titan Xp GPU in my lab for deep learning training using TF Keras.
Recently, we faced frequently the following error:
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU.
Result of dmesg:
[33472.926944] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
Below the link for the nvidia bug report: