nvidia-smi on C1060 returns no data

lutormx · January 8, 2009, 9:36pm

Hi,
I just installed our newly donated C1060 (Thanks Nvidia!) and I’m trying to read the temperatures. However, nvidia-smi returns nothing useful:

[root@governator ~]# nvidia-smi
Gpus found in probe:
Found Gpuid 0x1000
Attaching all probed Gpus…OK
Getting unit information…OK
Getting all static information…

So it looks like it found something. But the log contains:

[root@governator ~]# cat nvidia-smi.log

==============NVSMI LOG==============

Timestamp : Thu Jan 8 13:32:48 2009

which is not very informative. The card runs kernels fine, so everything seems to be installed correctly. This is with drivers 177.82 installed through “nvidia-installer --update” on a Fedora 8 x86_64 box.

Any ideas what to try?

Thanks,

/Patrik

tmurray · January 8, 2009, 9:47pm

As far as I know, nvidia-smi only supports the S870 and S1070. I’ve yet to figure out a good alternative to reading temperatures on the C1060 under Linux, but I guess this will push me into doing so…

lutormx · January 8, 2009, 9:52pm

Ah, ok. I thought all the cards had similar monitoring. It would be nice to know, because the card is damn hot, I practically burn my fingers touching the back side. It’s possible the box needs more ventilation as it’s got a lot of stuff heating it, but it’s hard to know without numbers.

mfatica · January 8, 2009, 10:59pm

If you are running X, you could use nvidia-settings.

The command line option (assuming the Tesla is gpu 1) is:
nvidia-settings -q [gpu:1]/gpucoretemp

lutormx · January 8, 2009, 11:05pm

Hmm. I’m not running X, but I could start it if that would help. But how does it work running X with a Tesla? (The monitor is hooked up to an old MX4400 PCI card.)

tmurray · January 8, 2009, 11:31pm

So, the new nvclock (0.8b4) is able to read the temps on my C1060, but only after I’ve started X–just setting up the device IDs isn’t enough, it seems. Weird.

lutormx · January 8, 2009, 11:58pm

Aha, that actually works for me, even without the X server. It prints some warning about missing NV-CONTROL extension on display 10.0, but spits out numbers anyway:

Xlib: extension “NV-CONTROL” missing on display “localhost:10.0”.

nvidia Tesla C1060

=> GPU temperature: 83C

=> Board temperature: 67C

Now, what’s Tmax? I’ve never seen that published.

Thanks,

/Patrik

seibert · January 9, 2009, 2:07am

A few years ago, there were reports of bit errors appearing in other cards (not Tesla) at the 95-100C level.

I’d be curious to know if NVIDIA has test the temperature limits of these cards…

lutormx · January 24, 2009, 1:15am

Hi again,

I’m now using “nvclock -T” to read out the C1060 temperature and aggregate it in a plot using rrdtool. This works, but I don’t understand the behavior of the temperature. See this plot:

External Media

Notice how the gpu temp shows this long state at higher temperature. The thing is that no kernel has been executing during these times. There were only two kernel runs this day, and those show up as the spikes at about times 10:30 and 17:00. Yet the card just suddenly went up in temp at 11:00 and stayed high.

Any idea what could be causing this? Can a kernel “hang” on the card and just loop endlessly? Does the driver do something to the card when it idles, like secretly compute seti@home or something? ;) Or, which seems much more likely, are there issues with the temperature readout?

Thanks,

/Patrik

MisterAnderson42 · January 24, 2009, 5:36pm

Well, look at both kernel runs: the GPU went up to ~82 C on each. The idle temp just went up from 70C to ~76 C. Have you listened to the fans running on the GPU? My headless system always runs it fans at 100% on a cold boot. Running a single CUDA program (even deviceQuery) ramps the fans down to a low idle.

This could account for the higher idle temp you are seeing if your system has the same behavior. But this is only a guess.

lutormx · January 24, 2009, 5:49pm

Ah, good point. I don’t know about the fan speeds. Nvclock doesn’t give any fan info, so I don’t know how to read it out, and the machine is in a room that’s so loud I can’t hear the fan on the board.

Topic		Replies	Views
nvidia-smi for S1070 Is there any? CUDA Programming and Performance	4	3237	December 1, 2008
Nvidia-smi 0c and err on fan col Linux	0	456	June 22, 2023
read temperature without X11 GeForce 295 GTX second GPU fault? CUDA Programming and Performance	3	2289	December 12, 2009
Unable to read temperature of GPU OS: Fedora 10, GPU:9600GT CUDA Programming and Performance	4	2774	February 2, 2010
How to get the TESLA C1060 temparature in my application? CUDA Programming and Performance	0	676	March 14, 2011
How to read GPU temperature from CLI CUDA Programming and Performance	4	18564	February 18, 2018
measuring device temperature CUDA Programming and Performance	18	12202	November 6, 2009
nvidia-smi Linux	1	2296	January 17, 2015
Getting the temperature data in the non-graphics mode ? CUDA Programming and Performance	1	1205	October 3, 2009
nvidia-smi tool and Tesla M2050 Doesn't report temperature value? CUDA Programming and Performance	7	8617	June 13, 2012

nvidia-smi on C1060 returns no data

Related topics