Tesla Temperature Monitoring

I’d like to readout the current temperature of a TESLA C1060 card.

Could anybody please give me a hint how to find code snippets in C or a script ?

thanks from the maintenance man.

NVAPI is what you want:
[url=“http://developer.nvidia.com/object/nvapi.html”]http://developer.nvidia.com/object/nvapi.html[/url]

We are running an app on a K80 that for the first 2-3 minutes does just fine - but the temperature of one of the GPUs goes up steadily to 90C after 3 minutes and the clock speeds then throttle to between a third to an eighth of what they were. There is only a passive heat sink. Has anyone else overcome this hurdle?
TIA.

$ nvidia-smi
+------------------------------------------------------+
| NVIDIA-SMI 340.32 Driver Version: 340.32 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 91C P0 110W / 149W | 940MiB / 11519MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:06:00.0 Off | 0 |
| N/A 63C P0 120W / 149W | 940MiB / 11519MiB | 64% Default |
+-------------------------------+----------------------+----------------------+

K80 is designed to be installed in an OEM-qualified server that has been designed and certified by the OEM for K80. It sounds like you have plugged this into some other platform. In that case, this is exactly what you should expect - there is not adequate cooling provided by the K80 card itself.

A proper K80 OEM server monitors this temperature and varies airflow across the passive heatsink accordingly, to manage cooling.

1 Like

Hi!

I see this is an old thread but it’s still an issue.
I’m running a K80 in an HP ML350 Gen9 server. I can assure you there’s plenty of airflow. Power shouldn’t be a problem as it has 4x800W PSUs.

GPU0 idles at 57˚C while GPU1 at 36˚C. I can run a Tensorflow training on GPU1 just fine, as its temperature rises the server fans ramp up accordingly and the GPU sits at 75˚C for days while training.

If I start the training on GPU0, it just heats up after about 30 training steps which is approximately 30 seconds from the actual workload hits the GPU. At about 95˚C my server thermal reboots before fans could ramp up.

I’m guessing that the temperature reported is just false on GPU0. Why would it be 20 degrees higher than GPU1 when none them does anything?

All this running the latest supported driver: 470.239.06

Your GPU0 is behaving as if the cooling is not happening correctly. I’m not sure why you would discount the temperature results when you state yourself that your server thermal reboots.

I can’t explain what is going on, but for some reason although the server appears to be properly cooling GPU1 (based on your description) it does not appear to be properly cooling GPU0.

Doing a google search for this topic, I observe a couple remarks that could be relevant:

  1. The HPE server might only recognize GPUs that are HPE-specific, i.e. have HPE identifying information in the VBIOS.
  2. The ML350 in particular seems to want its Tesla GPUs in PCIE slot 6. There may be issues using other slots.

I can’t independently confirm either of these, but if they are correct, then it may also help to explain your observations.

If I shut the process down before the GPU reaches the shutdown temperature then it starts to cool back. And while it’s still at a high temperature the fans start to ramp up. So it’s like it’s just heating faster than the fan ramp up can kick in or sending the temperature reports too slow.

I’ve seen several forum threads on similar situation with different systems. The thing all of them having in common is GPU0 being hotter all the time and heating up under load.
I even tried adding another p400 to be the one with index 0 as it’s proposed here.

Another thing to mention is that I’m running it with Pop OS guest on Proxmox. But I also tried with a Debian Bookworm guest. Same problem.

Tried the card to render games on Windows guest in wddm mode using the p400 to plug the monitor in. Also worked fine with GPU1 but same error with GPU0.

Really odd.

GPU recognized by BIOS as K80, so it’s not the issue. I’ll try slot6.

Another thing to check, apart from the obvious missing air baffle or fans, is that any unused PCIe slots have blanking plates fitted on the rear panel, to avoid bypassing airflow.

Check for accumulated dust both in the K80 and the enclosure as a whole. Blow the dust off all electronic components with a can of compressed air from a can. It is also possible that a temperature sensor has become unplugged or defective, although this is rare in my experience.

I do not know the specifics of cooling in this server. Often there is a bank of fans, and it is possible for one of the fans to become defective while the others still work, providing an impression of proper airflow. Many systems will have pre-boot hardware diagnostics in the BIOS that include a test of the system’s fans.

I assume you tried exchanging GPU0 and GPU1 to test, whether it is the GPU or the slot location?

Airflow is fine, blanks are in place, air buffle is there and all fans are working. ILO would scream anyway if any of them would be defected.

GPU is clean, temp sensor couldn’t be unplugged as it’s eventually ramping the fans but too late.

As I wrote in my other reply everything is in place regarding cooling and if any fans would be missing or defect ILO would shout all day. Once I removed a fan from the running server and it caused all the fans immediately run at 100% until I put back the missing fan.

What do you mean exchanging? The 2 GPUs are on the same card.

Now it’s in slot6 considering Robert’s advice and it’s all the same except now it’s running at 85-88 ˚C on GPU1 and not ramping the vents further. But GPU0 is sitting there at 80 ˚C doing nothing.

I have only experienced this kind of delayed fan response once and it almost caused my CPU to overheat and shut down the system. CPU temperature was already in the 90+ degree Celsius range. Best I could tell this happened because the thermal sensor on the motherboard was coated with a thick layer of dust, which apparently acted like an insulating blanket, so the increase in the case temperature wasn’t noted right away (normally case fans respond within seconds). After I thoroughly cleaned out five years of accumulated dust, the system ran flawlessly. I now clean it once a year.

If the heat sink fins of the GPU are clean, and there is unobstructed airflow directed across the GPU, the overheating of the GPU must be due to insufficient airflow produced by the bank of fans in the enclosure. That’s why I recommended cleaning the inside of the case, to make sure the sensor(s) are able to measure temperature properly. As I said, it is also possible for temperature sensors to become defective although that is rare because they are simple devices. Sometimes the sensors are plugged into a 2-pin header on the motherboard rather than solder on and can become unplugged when someone installs HW in the system and accidentally rips out the sensor’s cable.

It’s also possible for I2C controllers typically involved in thermal sensing and fan control to become defective. Likewise, it is possible for the K80 to have developed hardware defects after nearly 10 years of usage (don’t know if the system’s fan control takes data from the GPU’s temperature sensor into account; it seems plausible that it does).

If you acquired this system from a proper system integrator, that vendor should help you resolve the issue. If this is a system you cobbled together on your own, I provided all the knowledge I have about these situations, that is all I can do to help.

1 Like

Sorry, then I misunderstood. You were talking about the two GPUs comprising the K80; and not about two K80 cards in different slots.

I would try to organize a second K80 to test (by replacing), whether it is the graphics card or the HW/SW of the system.

If you enter hpasmcli -s "show fan; show temp", do both temperature sensors have the same fan threshold?

Unfortunately I don’t have a contract with HPE thus I don’t have the hpasmcli installed.

If you haven’t done it already, you may wish to update your server firmware to the most recent version available from HPE for your particular ML350 (Gen 9) variant.

Just did it, but as I don’t have contract I could only download the latest critical which is still a couple of releases newer than what was installed before. But it didn’t help.