An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 5, subpartition 0

Hello.
we use an K80 grafikcard and the opencl 1.1 library. The server is an proliant 380 G9 server with RedHat 6.5 on it. When a college kills a thread with kill -9 on the server the grafikkard creates this error and the server stops. The server was then in an unconnectable state. We must restart the server so solve this problem. In the log we found this:

Oct 17 14:22:26 node0 kernel: NVRM: GPU at PCI:0000:86:00: GPU-4214893c-f01f-7bf8-0583-a45f2746bbb4
Oct 17 14:22:26 node0 kernel: NVRM: GPU Board Serial Number: 0325214065534
Oct 17 14:22:26 node0 kernel: NVRM: Xid (PCI:0000:86:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 5, subpartition 0
Oct 17 14:22:26 node0 kernel:
Oct 17 14:22:26 node0 kernel: NVRM: Xid (PCI:0000:86:00): 62, 13e9(2468) 00000000 00000000

Has anyone an idea what has happend here?

Best Regards
Markus

Does this happen reliably? You’ll probably have better luck in the CUDA forums since I’m not very familiar with this particular class of problems.

Hello,
i have moved my request to the CUDA forum.

Thanks a lot.
Markus