I just installed our newly donated C1060 (Thanks Nvidia!) and I’m trying to read the temperatures. However, nvidia-smi returns nothing useful:
[root@governator ~]# nvidia-smi
Gpus found in probe:
Found Gpuid 0x1000
Attaching all probed Gpus…OK
Getting unit information…OK
Getting all static information…
So it looks like it found something. But the log contains:
[root@governator ~]# cat nvidia-smi.log
Timestamp : Thu Jan 8 13:32:48 2009
which is not very informative. The card runs kernels fine, so everything seems to be installed correctly. This is with drivers 177.82 installed through “nvidia-installer --update” on a Fedora 8 x86_64 box.
Ah, ok. I thought all the cards had similar monitoring. It would be nice to know, because the card is damn hot, I practically burn my fingers touching the back side. It’s possible the box needs more ventilation as it’s got a lot of stuff heating it, but it’s hard to know without numbers.
I’m now using “nvclock -T” to read out the C1060 temperature and aggregate it in a plot using rrdtool. This works, but I don’t understand the behavior of the temperature. See this plot:
Notice how the gpu temp shows this long state at higher temperature. The thing is that no kernel has been executing during these times. There were only two kernel runs this day, and those show up as the spikes at about times 10:30 and 17:00. Yet the card just suddenly went up in temp at 11:00 and stayed high.
Any idea what could be causing this? Can a kernel “hang” on the card and just loop endlessly? Does the driver do something to the card when it idles, like secretly compute seti@home or something? ;) Or, which seems much more likely, are there issues with the temperature readout?
Well, look at both kernel runs: the GPU went up to ~82 C on each. The idle temp just went up from 70C to ~76 C. Have you listened to the fans running on the GPU? My headless system always runs it fans at 100% on a cold boot. Running a single CUDA program (even deviceQuery) ramps the fans down to a low idle.
This could account for the higher idle temp you are seeing if your system has the same behavior. But this is only a guess.