Tesla Temperature Monitoring

Maintenance · May 11, 2009, 3:30pm

I’d like to readout the current temperature of a TESLA C1060 card.

Could anybody please give me a hint how to find code snippets in C or a script ?

thanks from the maintenance man.

Simon_Green · May 11, 2009, 7:07pm

NVAPI is what you want:
[url=“http://developer.nvidia.com/object/nvapi.html”]http://developer.nvidia.com/object/nvapi.html[/url]

bclark · July 10, 2015, 2:21pm

We are running an app on a K80 that for the first 2-3 minutes does just fine - but the temperature of one of the GPUs goes up steadily to 90C after 3 minutes and the clock speeds then throttle to between a third to an eighth of what they were. There is only a passive heat sink. Has anyone else overcome this hurdle?
TIA.

$ nvidia-smi
+------------------------------------------------------+
| NVIDIA-SMI 340.32 Driver Version: 340.32 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 91C P0 110W / 149W | 940MiB / 11519MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:06:00.0 Off | 0 |
| N/A 63C P0 120W / 149W | 940MiB / 11519MiB | 64% Default |
+-------------------------------+----------------------+----------------------+

Robert_Crovella · July 10, 2015, 2:59pm

K80 is designed to be installed in an OEM-qualified server that has been designed and certified by the OEM for K80. It sounds like you have plugged this into some other platform. In that case, this is exactly what you should expect - there is not adequate cooling provided by the K80 card itself.

A proper K80 OEM server monitors this temperature and varies airflow across the passive heatsink accordingly, to manage cooling.

tothadam.onp · May 6, 2024, 12:21pm

Hi!

I see this is an old thread but it’s still an issue.
I’m running a K80 in an HP ML350 Gen9 server. I can assure you there’s plenty of airflow. Power shouldn’t be a problem as it has 4x800W PSUs.

GPU0 idles at 57˚C while GPU1 at 36˚C. I can run a Tensorflow training on GPU1 just fine, as its temperature rises the server fans ramp up accordingly and the GPU sits at 75˚C for days while training.

If I start the training on GPU0, it just heats up after about 30 training steps which is approximately 30 seconds from the actual workload hits the GPU. At about 95˚C my server thermal reboots before fans could ramp up.

I’m guessing that the temperature reported is just false on GPU0. Why would it be 20 degrees higher than GPU1 when none them does anything?

All this running the latest supported driver: 470.239.06

Robert_Crovella · May 6, 2024, 1:46pm

Your GPU0 is behaving as if the cooling is not happening correctly. I’m not sure why you would discount the temperature results when you state yourself that your server thermal reboots.

I can’t explain what is going on, but for some reason although the server appears to be properly cooling GPU1 (based on your description) it does not appear to be properly cooling GPU0.

Doing a google search for this topic, I observe a couple remarks that could be relevant:

The HPE server might only recognize GPUs that are HPE-specific, i.e. have HPE identifying information in the VBIOS.
The ML350 in particular seems to want its Tesla GPUs in PCIE slot 6. There may be issues using other slots.

I can’t independently confirm either of these, but if they are correct, then it may also help to explain your observations.

tothadam.onp · May 6, 2024, 2:17pm

If I shut the process down before the GPU reaches the shutdown temperature then it starts to cool back. And while it’s still at a high temperature the fans start to ramp up. So it’s like it’s just heating faster than the fan ramp up can kick in or sending the temperature reports too slow.

I’ve seen several forum threads on similar situation with different systems. The thing all of them having in common is GPU0 being hotter all the time and heating up under load.
I even tried adding another p400 to be the one with index 0 as it’s proposed here.

Another thing to mention is that I’m running it with Pop OS guest on Proxmox. But I also tried with a Debian Bookworm guest. Same problem.

Tried the card to render games on Windows guest in wddm mode using the p400 to plug the monitor in. Also worked fine with GPU1 but same error with GPU0.

Really odd.

GPU recognized by BIOS as K80, so it’s not the issue. I’ll try slot6.

rs277 · May 6, 2024, 7:27pm

Another thing to check, apart from the obvious missing air baffle or fans, is that any unused PCIe slots have blanking plates fitted on the rear panel, to avoid bypassing airflow.

njuffa · May 6, 2024, 7:59pm

Check for accumulated dust both in the K80 and the enclosure as a whole. Blow the dust off all electronic components with a can of compressed air from a can. It is also possible that a temperature sensor has become unplugged or defective, although this is rare in my experience.

I do not know the specifics of cooling in this server. Often there is a bank of fans, and it is possible for one of the fans to become defective while the others still work, providing an impression of proper airflow. Many systems will have pre-boot hardware diagnostics in the BIOS that include a test of the system’s fans.

Curefab · May 7, 2024, 7:15am

I assume you tried exchanging GPU0 and GPU1 to test, whether it is the GPU or the slot location?

tothadam.onp · May 8, 2024, 9:26am

Airflow is fine, blanks are in place, air buffle is there and all fans are working. ILO would scream anyway if any of them would be defected.

tothadam.onp · May 8, 2024, 9:29am

GPU is clean, temp sensor couldn’t be unplugged as it’s eventually ramping the fans but too late.

As I wrote in my other reply everything is in place regarding cooling and if any fans would be missing or defect ILO would shout all day. Once I removed a fan from the running server and it caused all the fans immediately run at 100% until I put back the missing fan.

tothadam.onp · May 8, 2024, 9:34am

What do you mean exchanging? The 2 GPUs are on the same card.

Now it’s in slot6 considering Robert’s advice and it’s all the same except now it’s running at 85-88 ˚C on GPU1 and not ramping the vents further. But GPU0 is sitting there at 80 ˚C doing nothing.

njuffa · May 8, 2024, 9:44am

I have only experienced this kind of delayed fan response once and it almost caused my CPU to overheat and shut down the system. CPU temperature was already in the 90+ degree Celsius range. Best I could tell this happened because the thermal sensor on the motherboard was coated with a thick layer of dust, which apparently acted like an insulating blanket, so the increase in the case temperature wasn’t noted right away (normally case fans respond within seconds). After I thoroughly cleaned out five years of accumulated dust, the system ran flawlessly. I now clean it once a year.

If the heat sink fins of the GPU are clean, and there is unobstructed airflow directed across the GPU, the overheating of the GPU must be due to insufficient airflow produced by the bank of fans in the enclosure. That’s why I recommended cleaning the inside of the case, to make sure the sensor(s) are able to measure temperature properly. As I said, it is also possible for temperature sensors to become defective although that is rare because they are simple devices. Sometimes the sensors are plugged into a 2-pin header on the motherboard rather than solder on and can become unplugged when someone installs HW in the system and accidentally rips out the sensor’s cable.

It’s also possible for I2C controllers typically involved in thermal sensing and fan control to become defective. Likewise, it is possible for the K80 to have developed hardware defects after nearly 10 years of usage (don’t know if the system’s fan control takes data from the GPU’s temperature sensor into account; it seems plausible that it does).

If you acquired this system from a proper system integrator, that vendor should help you resolve the issue. If this is a system you cobbled together on your own, I provided all the knowledge I have about these situations, that is all I can do to help.

Curefab · May 8, 2024, 9:53am

Sorry, then I misunderstood. You were talking about the two GPUs comprising the K80; and not about two K80 cards in different slots.

I would try to organize a second K80 to test (by replacing), whether it is the graphics card or the HW/SW of the system.

If you enter hpasmcli -s "show fan; show temp", do both temperature sensors have the same fan threshold?

tothadam.onp · May 8, 2024, 12:02pm

Unfortunately I don’t have a contract with HPE thus I don’t have the hpasmcli installed.

Robert_Crovella · May 8, 2024, 2:08pm

If you haven’t done it already, you may wish to update your server firmware to the most recent version available from HPE for your particular ML350 (Gen 9) variant.

tothadam.onp · May 8, 2024, 2:17pm

Just did it, but as I don’t have contract I could only download the latest critical which is still a couple of releases newer than what was installed before. But it didn’t help.

Topic		Replies	Views
K80 GPU0 overheat in compatible server CUDA Setup and Installation	0	169	May 6, 2024
GPU 0 Overheating if >1 Tesla K80 Installed Tesla Boards	2	1938	May 27, 2021
Tesla K80 overheating Linux	5	6949	February 12, 2017
Tesla K80 cannot work properly in Ubuntu 14.04 Linux	0	826	November 9, 2015
Tesla D870 GPU core temp Is there a way to read it? CUDA Programming and Performance	8	6907	April 11, 2008
how to monitor and control the temperature of the GPU(s) of one or more Tesla C1060 cards? CUDA Programming and Performance	1	3260	February 11, 2009
nvapi gpu thermal info CUDA Programming and Performance	0	718	July 9, 2015
The GPU FAN runs heavily after the process is done. CUDA Setup and Installation	19	4948	July 20, 2017
Problems after inserting a P100 CUDA Setup and Installation	35	3943	October 31, 2021
Tesla K80 Initital Setup Problem Tesla Boards	4	8597	February 18, 2021

Tesla Temperature Monitoring

Related topics