Tesla D870 GPU core temp Is there a way to read it?

I have a desktop machine with an 8800 GTX and a Telsa D870 attached to a Linux workstation. On the desktop, I can read the GPU Core temp using gkrellm or by running nvidia-settings -t -q GPUCoreTemp. When I try to run this command on the Tesla system (no X windows installed), I get an error that there is no active X display. Even were X running, I’m not sure if nvidia-settings will read the Tesla temps, does anyone know?

Is there any way to read these core temps on the D870? I would prefer not to install X windows, but I will if I must.

I need to know the temps because I’m currently running some stability testing on my application. On the desktop, my application only runs for ~2 to 5 hours before the system freezes for a few seconds and a kernel returns a “launch timeout” error. For the record, all my kernels complete with in 20 ms so this is not caused by the by a kernel taking a little extra time: it seems to be a kernel “crashing” on the card thus triggering an infinite loop and thus failing to kick the watchdog. While the simulation is running, the GPU core temp goes up to 79 C.

On the Tesla system, I’ve run the simulation for 14+ hours without any problems. I know the Tesla D870 has better cooling than my desktop, but I need to know the core temp to make a quantitative comparison and support the hypothesis that the crashes on my desktop are due to overheating.

There is thermal monitoring on the S870. I will ask if the same software (nvidia-smi) will work on the D870. The software is included in the 171.06 driver.

That’s neat. I had no idea such a tool existed. It seems to work fine on the D870, reporting reasonable values for the intake, exhaust and GPU core temps which is all I’m interested in.

It complains about the PSU: probably one of the things only supported on the S870. And I have no idea if the fan information is accurate.

Edit: it can even toggle the LED on the front from amber to green :)

Josh,

If it’s overheating, you should provide a copy of HOOMD to these benchmarking sites to they can use your code to stress test GPUs for heat/power :)

John

Heh, that’s a good idea. I suppose when there is only slight overheating in games just a few pixels flicker, and you need to get the GPU really hot to crash the game. In my testing, HOOMD seems very sensitive to overheating/overclocking.

And nvidia-smi reports a significantly cooler 73 C when I run on the Tesla box. So that at least supports my overheating hypothesis.

Though, I guess to really know for certain if the random problems I’m seeing on the desktop are from overheating I’m going to need to crack the case and point a fan at it to see if everything runs smoothly or not.

We’ve got our boxes setup in a proper machine room with air chillers etc, and we’ve never had any heat issues, even with 3 GPUs in one box running flat out. If you’re having thermal issues, moving the box into a machine room could save you a lot of hassle. I’d definitely do that once you start running production simulations on your CUDA boxes.

Cheers,

John

Unfornutately, I can’t move my everday desktop office machine to the server room :)

I did test it with the case open and the GPU ran 3 degrees cooler at 76 C with no crashing problems for 14 hours. The overheating problem is there because the 8800 GTX is in an old Dell workstation case with essentially no case airflow (it was the only machine we had at the time with a big enough power supply for the card).

The D870 machine is in the cooled server room now, though and running another few degrees cooler. In the test I mentioned above, it was sitting next to the desktop, which made for a very cramped desk: 1 desktop, 1 workstation, 1 laptop, the Tesla, an old CRT monitor, an LCD display, two keyboards and two mice all on one ~6 foot by ~4 foot desk.

Yeah, some of the off-the-shelf commercial machines have next to no airflow. For these 3-GPU test boxes we built, we used an antec gamer-oriented case, which has worked out very well. I installed 3 fans in each of them, and neither one has had any trouble so far. I did one of the CUDA test boxes running in a weakly-air-conditioned conference room for a few weeks last summer without trouble (probably 75-80 degrees ambient temp with the sun coming in the windows), so with enough fans you can make almost anything work. You might see if there’s a spot you could rig up an extra fan or two on your dell box. If you wanted to get serious you could always do a case mod with some power tools… :-)

Cheers,
John

Just as a thought, having a 8600 and 2x Tesla cards in a Cooler Master Cosmo case. My cards routinely operate in the 70s. In the dead of Columbus OH winter with all the windows in my apt open, I was able to get the cards to report temps down in the 50s… though the ambient temp required me to wear a jacket and long johns… :)