Impact of enabling ECC on power and performance


We have over 500+ RTX8000 GPUs (active mode) in production use for ML workloads. One of the cards was producing errored results, which in turn prompted us to look into memory errors. We enabled ECC on this card and readily found a sequence of DBEs.

Since ECC is not enabled by default on these cards, we’re planning to enable ECC on all of them. However, I’m curious to know (as we have a lot of cards):

  • What is the expected impact to power consumption? Is it nominally significant? A rough guideline would help.
  • Is there a performance impact? i.e. is there any additive latency to GDDR operations or is it wholly a power dimension?


Good question but I guess the wrong place to ask for it. Doing a quick search over at the cuda performance forum yields:
and old
So seems it has quite an impact on memory bandwidth. Impact on a specific application depends how memory-bound it is.

Unless you have HBM2 memory. Interesting read.