A100 crashes within 10 minutes due to over-heating on Ubuntu 18.04 (without any workload)

System specifications -
Motherboard - ASUSTeK Pro WS WRX80E-SAGE SE WIFI
CPU - AMD Ryzen Threadripper PRO 3995WX 64-Cores
GPU - A100 and GTX 1080
PSU - ST1500-TI (ST1500-TI)
OS - Ubuntu 18.04

The GPU Connection is the same as listed in the A100 product brief page,
i.e. CPU 8-Pin to PCIe 8-Pin Power Adapter is used. both two PCLe 8-in is connected

Approaches tried -

  1. Installing the driver downloaded via Official Advanced Driver Search | NVIDIA
    Drivers installed were of CUDA version 11.2

  2. Installing via Linux’s “Additional drivers” system, the drivers installed were of CUDA version 11.0

  3. Follow NVIDIA Driver Installation Quickstart Guide :: NVIDIA Tesla Documentation along with post-install actions

All three approaches led to the same outcome -
The drivers would be installed and we would see this when typing nvidia-smi

However, within mere 10 minutes, a100 would heat up to 95 degrees and would then subsequently crash

This behavior was seen across the board, regardless of the method for driver installation.

Another observation was the fact that regardless of the approach, the fans for A100 were not showing up inside nvidia-smi however, upon physical inspection, the fans were indeed ramping up in speed.

2 Likes

Have you found any resolution? I have very similar setup (same MoBo, CPU, Ubuntu 20). A100 came as passive cooled heat sink. Open air and forced air it still climbs to 95 and crashes.

I did find in the BIOS that the 95 degree crash is a setting in there. You can adjust the levels and read the logs of the hardware in there.

I’m at a loss. Under zero load it keeps heating up. Considering liquid cooled at this point.

Nope our team essentially gave up. No help from Nvdia either

I will try to duplicate issue internally and update on it.

1 Like

thank you for looking into this. Are there extra cooling steps that need to be taken for this card? I noticed that mine is passive/conductive cooling, but there are no visible large fans. I have plenty of airflow in the case, and have since tried with all sides open the case to allow heat dissipation. Under zero load this morning, the card is taking longer to heat up but has still climbed from an idle 33 degrees C to 84 degrees C in 20 minutes. Shutting down now to preserve the device.

A100 cards are built for special servers, not workstations. They don’t have fans but they need to be cooled by extra fans in the server rack.
If you want to run them in workstations, you’ll have to use add-on fans, something like this:

also check ebay.

generix,

I am learning that you are quite right wrt the workstation and airflow. I honestly had thought the amount of airflow in the case would have been sufficient, but that is not the case. I have found a bracket similar to what you posted a link for. I have to invert it (slope the other direction) to meet my system design, but I am hopeful this will work.

On a related note, I did take a basic Noctua 80mm case fan I had laying around, plugged it in, and just pushed it within an inch of the card, blowing air directly down the cooling fans. It has maintained 54 degrees C for hours before putting it under load. Once we put it under load, it shot back up in temp. We noted that the Wattage use once in the 80s C was only at 90W. Not sure if this is thermal throtteling or not (heavy computation load using all 6912 cores should have probably used up more than that). Cooling it back down now to assess if same code runs more power at lower temps. Regardless, it does show that this is purely an airflow / thermal issue so far.

Since the A100 are 250/300W boards, they need a lot of air forced through them.
When done right, with nvidia-persistenced running and while idle they should be well under 30°C. Furthermore, since HBM is sensitive to high temperatures, they shouldn’t reach 80°C, otherwise they’re throttling.

Linus Torwalds says: do not buy Nvida card for you computer:

In my NoteBook wich is very very very expensive, with Nvidia card on board - is not working