we use an K80 grafikcard and the opencl 1.1 library. The server is an proliant 380 G9 server with RedHat 6.5 on it. When a college kills a thread with kill -9 on the server the grafikkard creates this error and the server stops. The server was then in an unconnectable state. We must restart the server so solve this problem. In the log we found this:
Oct 17 14:22:26 node0 kernel: NVRM: GPU at PCI:0000:86:00: GPU-4214893c-f01f-7bf8-0583-a45f2746bbb4
Oct 17 14:22:26 node0 kernel: NVRM: GPU Board Serial Number: 0325214065534
Oct 17 14:22:26 node0 kernel: NVRM: Xid (PCI:0000:86:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 5, subpartition 0
Oct 17 14:22:26 node0 kernel:
Oct 17 14:22:26 node0 kernel: NVRM: Xid (PCI:0000:86:00): 62, 13e9(2468) 00000000 00000000
Has anyone an idea what has happend here?
I will assume this is a server with vendor-integrated K80, not some jury-rigged home-brew configuration, and that the server is not being operated in some harsh environment (e.g. near a powerful source of EMI or RFI).
The K80 implements ECC with SECDET (single error correct, double error detect). That means it is capable of correcting single bit errors (e.g. due to cosmic ray flipping one bit) which are logged. nvidia-smi can show you the current count of such corrections. While double-bit errors cannot be corrected due to lack of sufficient information, they can be detected. So as not to silently continue operation with incorrect data, the GPU is halted. In that respect it works much the same as a server with ECC-protected system memory.
I do not see how a double-bit error could be caused by the killing of the thread, but there may be indirect linkage as follows: killing the thread triggered a tear-down of a GPU context connected with that thread, during which the corrupted data was accessed, triggering the double-bit event on the GPU.
I think what you would want to do is look closely at the ECC error statistics for this card: are there many single-bit errors recorded in addition to the one double-error event, and do the counts continue to increase under further usage? If so, I would contact the system vendor, as this could indicate a problem with the memory on the GPU and you may need to replace the GPU. All electronic devices physically age due to multiple failure mechanisms and this can lead to device failure. The aging process is accelerated by operating at elevated temperatures.