We have over 500+ RTX8000 GPUs (active mode) in production use for ML workloads. One of the cards was producing errored results, which in turn prompted us to look into memory errors. We enabled ECC on this card and readily found a sequence of DBEs.
Since ECC is not enabled by default on these cards, we’re planning to enable ECC on all of them. However, I’m curious to know (as we have a lot of cards):
- What is the expected impact to power consumption? Is it nominally significant? A rough guideline would help.
- Is there a performance impact? i.e. is there any additive latency to GDDR operations or is it wholly a power dimension?