My questions are related to A100 GPUs memory errors and not to the host memory as your link reference to.
We would like to ask NVIDIA engineering what is the statistical expected A100 DRAM memory error rate or degraded memory statistics and if it is time/usage/power dependent ?
In addition, we would like to know the criteria or thresholds for row remapping that is done by the device itself due to correctable errors, which is currently unclear.
To be more specific, our customer have ~2000 A100 devices with the following issues that we would like to better understand and get your comments about the following cases that we observe:
-
We have 4 devices with remapped rows due to correctable errors greater than 0 , but the aggregate DRAM ECC errors are all 0. How can it be ?
-
We have a device with row remapping due to one DRAM correctable error and two devices with no remapping for 1 and 3 correctable DRAM error respectively. We would like to understand why remapping happened after one correctable error on one device, but didn’t happen for the two others that had 1 and three correctable errors respectively ?
-
We have one device with 5 SRAM correctable errors and another device with 7094 SRAM correctable errors. Isn’t the 7094 number too high and may it indicate a problem, as all the other A100 devices with the same age and usage has 0 SRAM correctable errors (except one device with only 5 SRAM correctable errors) ?
-
We have a device with 1950402 DRAM correctable errors, where 8 remapping already occurred all on the same bank. We see that aggregate correctable errors are still increasing but remapping failure didn’t happen yet. Again, what is the criteria for row remapping due to so many correctable DRAM errors and does such large number of aggregate correctable errors indicate a degraded memory ?