GPU 0 Overheating if >1 Tesla K80 Installed

Chassis: Supermicro | Products | SuperServers | 1U | 1027GR-TRF
OS: Ubuntu 20.04.2 LTS

What I’ve tried so far:

  1. Testing single Tesla K80 cards - all test fine and run in the 50-60C during training when just one card is installed.
  2. Testing two Tesla K80 cards - regardless of the slot placed, GPU 0 heats up to around 90C after 10-15 minutes of training.
  3. Testing three Tesla K80 cards - regardless of which slots they are placed, GPU 0 heats up to around 90C after 5-10 minutes of training.
    Fans are all working fine when visually inspected.
    sudo ipmitool sdr type Temperature|grep GPU shows the cards (GPU1-6) but says “No Reading” for the temperature on all.
    Here is nvidia-smi
    Wed May 26 12:17:47 2021
    ±----------------------------------------------------------------------------+
    | NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
    |-------------------------------±---------------------±---------------------+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    | | | MIG M. |
    |===============================+======================+======================|
    | 0 Tesla K80 Off | 00000000:04:00.0 Off | Off |
    | N/A 37C P8 26W / 149W | 4MiB / 12206MiB | 0% Default |
    | | | N/A |
    ±------------------------------±---------------------±---------------------+
    | 1 Tesla K80 Off | 00000000:05:00.0 Off | Off |
    | N/A 30C P8 29W / 149W | 4MiB / 12206MiB | 0% Default |
    | | | N/A |
    ±------------------------------±---------------------±---------------------+
    | 2 Tesla K80 Off | 00000000:08:00.0 Off | Off |
    | N/A 40C P8 25W / 149W | 4MiB / 12206MiB | 0% Default |
    | | | N/A |
    ±------------------------------±---------------------±---------------------+
    | 3 Tesla K80 Off | 00000000:09:00.0 Off | Off |
    | N/A 34C P8 30W / 149W | 4MiB / 12206MiB | 0% Default |
    | | | N/A |
    ±------------------------------±---------------------±---------------------+
    | 4 Tesla K80 Off | 00000000:87:00.0 Off | Off |
    | N/A 27C P8 26W / 149W | 4MiB / 12206MiB | 0% Default |
    | | | N/A |
    ±------------------------------±---------------------±---------------------+
    | 5 Tesla K80 Off | 00000000:88:00.0 Off | Off |
    | N/A 32C P8 28W / 149W | 4MiB / 12206MiB | 0% Default |
    | | | N/A |
    ±------------------------------±---------------------±---------------------+

Seems like a BIOS or software issue. Any idea where I’m going wrong?

Here is nvidia-smi during training:
Wed May 26 12:26:00 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:04:00.0 Off | Off |
| N/A 91C P0 93W / 149W | 11658MiB / 12206MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla K80 Off | 00000000:05:00.0 Off | Off |
| N/A 67C P0 89W / 149W | 8125MiB / 12206MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 Tesla K80 Off | 00000000:08:00.0 Off | Off |
| N/A 75C P0 83W / 149W | 8125MiB / 12206MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 Tesla K80 Off | 00000000:09:00.0 Off | Off |
| N/A 60C P0 92W / 149W | 8125MiB / 12206MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 Tesla K80 Off | 00000000:87:00.0 Off | Off |
| N/A 39C P0 90W / 149W | 8125MiB / 12206MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 Tesla K80 Off | 00000000:88:00.0 Off | Off |
| N/A 53C P0 100W / 149W | 8125MiB / 12206MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
It seems to only be GPU0 that ever reaches any significant temps. The others stay in normal ranges.

One additional point I noticed. I got ipmitool working. But when I check the GPU temperatures, it says “No reading”.

For now I have set the fan setting to “Full” in ipmicfg from Supermicro. That is keeping the GPUs all cool during training. But I am not looking forward to the electric bill. Is there a way to get ipmi to recognize the temperatures? I think if I can do that, it will take care of the rest on Optimal setting. TIA