nvidia-smi on C1060 returns no data

Hi,
I just installed our newly donated C1060 (Thanks Nvidia!) and I’m trying to read the temperatures. However, nvidia-smi returns nothing useful:

[root@governator ~]# nvidia-smi
Gpus found in probe:
Found Gpuid 0x1000
Attaching all probed Gpus…OK
Getting unit information…OK
Getting all static information…

So it looks like it found something. But the log contains:

[root@governator ~]# cat nvidia-smi.log

==============NVSMI LOG==============

Timestamp : Thu Jan 8 13:32:48 2009

which is not very informative. The card runs kernels fine, so everything seems to be installed correctly. This is with drivers 177.82 installed through “nvidia-installer --update” on a Fedora 8 x86_64 box.

Any ideas what to try?

Thanks,

/Patrik

As far as I know, nvidia-smi only supports the S870 and S1070. I’ve yet to figure out a good alternative to reading temperatures on the C1060 under Linux, but I guess this will push me into doing so…

Ah, ok. I thought all the cards had similar monitoring. It would be nice to know, because the card is damn hot, I practically burn my fingers touching the back side. It’s possible the box needs more ventilation as it’s got a lot of stuff heating it, but it’s hard to know without numbers.

If you are running X, you could use nvidia-settings.

The command line option (assuming the Tesla is gpu 1) is:
nvidia-settings -q [gpu:1]/gpucoretemp

Hmm. I’m not running X, but I could start it if that would help. But how does it work running X with a Tesla? (The monitor is hooked up to an old MX4400 PCI card.)

So, the new nvclock (0.8b4) is able to read the temps on my C1060, but only after I’ve started X–just setting up the device IDs isn’t enough, it seems. Weird.

Aha, that actually works for me, even without the X server. It prints some warning about missing NV-CONTROL extension on display 10.0, but spits out numbers anyway:

Xlib: extension “NV-CONTROL” missing on display “localhost:10.0”.

nvidia Tesla C1060

=> GPU temperature: 83C

=> Board temperature: 67C

Now, what’s Tmax? I’ve never seen that published.

Thanks,

/Patrik

A few years ago, there were reports of bit errors appearing in other cards (not Tesla) at the 95-100C level.

I’d be curious to know if NVIDIA has test the temperature limits of these cards…

Hi again,

I’m now using “nvclock -T” to read out the C1060 temperature and aggregate it in a plot using rrdtool. This works, but I don’t understand the behavior of the temperature. See this plot:

External Media

Notice how the gpu temp shows this long state at higher temperature. The thing is that no kernel has been executing during these times. There were only two kernel runs this day, and those show up as the spikes at about times 10:30 and 17:00. Yet the card just suddenly went up in temp at 11:00 and stayed high.

Any idea what could be causing this? Can a kernel “hang” on the card and just loop endlessly? Does the driver do something to the card when it idles, like secretly compute seti@home or something? ;) Or, which seems much more likely, are there issues with the temperature readout?

Thanks,

/Patrik

Well, look at both kernel runs: the GPU went up to ~82 C on each. The idle temp just went up from 70C to ~76 C. Have you listened to the fans running on the GPU? My headless system always runs it fans at 100% on a cold boot. Running a single CUDA program (even deviceQuery) ramps the fans down to a low idle.

This could account for the higher idle temp you are seeing if your system has the same behavior. But this is only a guess.

Ah, good point. I don’t know about the fan speeds. Nvclock doesn’t give any fan info, so I don’t know how to read it out, and the machine is in a room that’s so loud I can’t hear the fan on the board.