Hello hope your day is going well.
We’ve been stress testing 3 new nodes that have 4x A100 GPUs and have some questions about power utilization and the effects on our A100 GPUs. We seem to be hitting 100% utilization (per nvidia-smi monitoring) regardless of the circumstances I describe below, but I’m hoping to verify we’re not jeopardizing card performance.
The power utilization for each card seems to drop when the card hits 84 degrees C – understood that this may be some sort thermal limit, but I just wanted to make sure I understood the ramifications?
We checked and monitored inlet air temps in the datacenter when this was happening and confirmed that source air was actually cooler when the cards were pegged at 84C.
Any help is much appreciated!