Hi, my company (KLA-Tencor) wants to know about GPU reliability.
According to the Tesla K10 data sheet, it says the MTBF is 48076 - 70506.47 hours. But it lists the temperature as 35 Celsius. This is the ambient temperature right, and not for the chip which would be very unrealistic? Also does failure mean the hardware needs to be replaced or it’s just transient? If it’s transient, does it mean an incorrect value is computed, or a crash?
Also, not that it matters (since the values would be short lived), are the shared memory/caches scrubbed in additional to global memory?