Chassis: Supermicro | Products | SuperServers | 1U | 1027GR-TRF
OS: Ubuntu 20.04.2 LTS
What I’ve tried so far:
- Testing single Tesla K80 cards - all test fine and run in the 50-60C during training when just one card is installed.
- Testing two Tesla K80 cards - regardless of the slot placed, GPU 0 heats up to around 90C after 10-15 minutes of training.
- Testing three Tesla K80 cards - regardless of which slots they are placed, GPU 0 heats up to around 90C after 5-10 minutes of training.
Fans are all working fine when visually inspected.
sudo ipmitool sdr type Temperature|grep GPU shows the cards (GPU1-6) but says “No Reading” for the temperature on all.
Here is nvidia-smi
Wed May 26 12:17:47 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:04:00.0 Off | Off |
| N/A 37C P8 26W / 149W | 4MiB / 12206MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla K80 Off | 00000000:05:00.0 Off | Off |
| N/A 30C P8 29W / 149W | 4MiB / 12206MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 Tesla K80 Off | 00000000:08:00.0 Off | Off |
| N/A 40C P8 25W / 149W | 4MiB / 12206MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 Tesla K80 Off | 00000000:09:00.0 Off | Off |
| N/A 34C P8 30W / 149W | 4MiB / 12206MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 Tesla K80 Off | 00000000:87:00.0 Off | Off |
| N/A 27C P8 26W / 149W | 4MiB / 12206MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 Tesla K80 Off | 00000000:88:00.0 Off | Off |
| N/A 32C P8 28W / 149W | 4MiB / 12206MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
Seems like a BIOS or software issue. Any idea where I’m going wrong?