Other than susceptibility to environmental factors, manufacturing variations may be at play. The enormous concentration of single-bit errors in just ten GPUs reported in the presentation seems suspicious to me, and does not jibe with the “cosmic ray” model of DRAM errors (a DRAM cell hit by a ray discharges, losing its storage). Something else would seem to be at play. (*)
I have no idea what sort of statistical distribution bit error rates are supposed to have based on other underlying mechanisms of action. In practical terms, the only people with sufficient data are likely hyperscalers like Google and AWS, or operators of massive GPU-acelerated supercomputers. And if they are not revealing their data in detail, you are out of luck.
The data you quoted from the presentation would allow you to estimate an upper bound of the error rate provided (1) it is stated over what span of time this data was collected (2) the amount of memory on each SXM is known. Then max bit error rate = 6,088,374 single-bit error * 0.98 * 0.1 / memory per SMX / hours of operation.
Looking at the slides, it seems the SMXs are K20s with 6 GB each, and the observational time was 22 months (2012 through 2014). If I punched in the numbers correctly, this gives a rate of about six single-bit errors / GB / hour for the ten worst GPUs. Excluding these ten worst GPUs, the error rate was only 6,088,374 single-bit errors * 0.02 / 18678 / 6 GB / 16104 hours ~= 6.7e-5 single-bit errors / GB / hour ~= 1.6 single-bit errors / TB / day, not far off my guesstimate.
What do you need the Kepler error rates for? Personally, I have a very conservative approach: if there is a possibility that an undetected bit error could lead to serious consequences, say an incorrect cancer diagnosis or a collapsing bridge, I want ECC in my hardware, everywhere. And I want the machine to halt on detection of a double-bit error.
(*) [Later:] This is actually covered in the presentation (slide 14; emphasis mine):
“Looking at the 10 SXMs that accounted for 98% of all SBEs, 99% of the errors on those 10 SXMs occurred on L2 cache. This is a clear indication of test escapes (5.35%) for the L2 cache. Removing the 10 worst cards, the distribution of the remainder of the errors is dominated by the device memory (GDDR5).”