I have a Quadro M5000 that is displaying uncorrectable double bit errors.
The server was under heavy load due to the irq/285-nvidia process, so I could only partially run nvidia-bug-report.sh before it appeared to hang. Looking through this file after a reboot, I found 30 or so of the following error messages between the beginning of April and today.
Apr 08 23:44:33 p186 kernel: NVRM: RmInitAdapter failed! (0x53:0x65:1949)
Apr 08 23:44:33 p186 kernel: NVRM: rm_init_adapter failed for device bearing minor number 1
Apr 09 00:09:03 p186 kernel: NVRM: GPU at PCI:0000:84:00: GPU-3ef61a56-fffe-aa7c-4fb4-d303ea4a1e3f
Apr 09 00:09:03 p186 kernel: NVRM: Xid (PCI:0000:84:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 2, subpartition 1.
Looking up the last error message, it appears the memory is failing on the GPU. The following nvidia documentation explains how these DBE errors are due to bad memory cells.
The RMA section of that page says we should look into an RMA after 10 such errors. Can you verify if this GPU is failing and needs to be replaced?