K80 GPU0 overheat in compatible server


I’m running a K80 in an HP ML350 Gen9 server. I can assure you there’s plenty of airflow. Power shouldn’t be a problem as it has 4x800W PSUs.

GPU0 idles at 57˚C while GPU1 at 36˚C. I can run a Tensorflow training on GPU1 just fine, as its temperature rises the server fans ramp up accordingly and the GPU sits at 75˚C for days while training.

If I start the training on GPU0, it just heats up after about 30 training steps which is approximately 30 seconds from the actual workload hits the GPU. At about 95˚C my server thermal reboots before fans could ramp up.

I’m guessing that the temperature reported is just false on GPU0. Why would it be 20 degrees higher than GPU1 when none of them does anything?

All this running the latest supported driver: 470.239.06