I have a desktop machine with an 8800 GTX and a Telsa D870 attached to a Linux workstation. On the desktop, I can read the GPU Core temp using gkrellm or by running nvidia-settings -t -q GPUCoreTemp. When I try to run this command on the Tesla system (no X windows installed), I get an error that there is no active X display. Even were X running, I’m not sure if nvidia-settings will read the Tesla temps, does anyone know?
Is there any way to read these core temps on the D870? I would prefer not to install X windows, but I will if I must.
I need to know the temps because I’m currently running some stability testing on my application. On the desktop, my application only runs for ~2 to 5 hours before the system freezes for a few seconds and a kernel returns a “launch timeout” error. For the record, all my kernels complete with in 20 ms so this is not caused by the by a kernel taking a little extra time: it seems to be a kernel “crashing” on the card thus triggering an infinite loop and thus failing to kick the watchdog. While the simulation is running, the GPU core temp goes up to 79 C.
On the Tesla system, I’ve run the simulation for 14+ hours without any problems. I know the Tesla D870 has better cooling than my desktop, but I need to know the core temp to make a quantitative comparison and support the hypothesis that the crashes on my desktop are due to overheating.