A100 crashes within 10 minutes due to over-heating on Ubuntu 18.04 (without any workload)

System specifications -
Motherboard - ASUSTeK Pro WS WRX80E-SAGE SE WIFI
CPU - AMD Ryzen Threadripper PRO 3995WX 64-Cores
GPU - A100 and GTX 1080
PSU - ST1500-TI (SilverStone ST1500-TI INTRODUCTION)
OS - Ubuntu 18.04

The GPU Connection is the same as listed in the A100 product brief page,
i.e. CPU 8-Pin to PCIe 8-Pin Power Adapter is used. both two PCLe 8-in is connected

Approaches tried -

  1. Installing the driver downloaded via Advanced Driver Search | NVIDIA
    Drivers installed were of CUDA version 11.2

  2. Installing via Linux’s “Additional drivers” system, the drivers installed were of CUDA version 11.0

  3. Follow NVIDIA Driver Installation Quickstart Guide :: NVIDIA Tesla Documentation along with post-install actions

All three approaches led to the same outcome -
The drivers would be installed and we would see this when typing nvidia-smi

However, within mere 10 minutes, a100 would heat up to 95 degrees and would then subsequently crash

This behavior was seen across the board, regardless of the method for driver installation.

Another observation was the fact that regardless of the approach, the fans for A100 were not showing up inside nvidia-smi however, upon physical inspection, the fans were indeed ramping up in speed.

2 Likes