Temperature monitoring Is sub-degree accuracy possible?

Hi,

I am undertaking a study looking at how the temperature of the GPU affects its power consumption for a variety of workloads. As is well known, the leakage of the GPU is reduced as the temperature is reduced. To do so, we are using a liquid cooled GTX 580 and we have constructed a power monitoring device for each of the PCIe power inputs to the GPU with sub 1% accuracy. However, the main source of my errors at the moment are in the temperature monitoring, which I am currently relying on using nvidia-smi to log the temperature.

Questions:

[list=1]

[*]nvidia-smi only reports temperature to the nearest degree. Is the temperature monitor built in the to the card capable of more accuracy, and if so, is there an easy way to access and log it? (linux or Windows is fine)

[*]If the built in the monitoring is insufficient for 0.1 degree accuracy, can anyone recommend any alternative method?

As a taste of what power efficiency improvement is possible I attach some results plots. The first shows the reduction in power draw for a constant workload running over a 32-82 C temperature range. The second plot shows the Gflops/w as a function of temperature. Note that when running air cooled, this kernel will typically run in the 80s, so the reduction through liquid cooling is significant.

The kernel we are using here is that reported in [1107.4264] Accelerating Radio Astronomy Cross-Correlation with Graphics Processing Units, which for the parameters chosen for this study runs in excess of one sustained Tflops, so is quite brutal with regards to power draw.

Thanks.
efficiency_temp_cold_room.pdf (4.52 KB)
power_temp_cold_room.pdf (4.15 KB)

How much energy are you spending on cooling the GPUs to that extent? Lets say you spend 20 J/s on cooling the GPU while the power consumption went down by 30 J/s, so you’re gaining 10 J/s. This is important to determine whether or not you are getting a net gain or loss, right?

Really interesting work anyways! This type of work hasn’t been done enough in mu opinion :)

We are spending about 15 watts on the cooling solution, but seem to have a 30 watt reduction in GPU power draw, so we do have a net reduction in power.

While this is a good point, we are mainly interested in understanding the underlying issues before thinking about how to build the most power efficient complete system - for example we spend 10 watts on the pump in the present solution, but this pump has capacity to drive much more than a single GPU. Thus this power cost can be amortized between multiple GPUs in a cluster.

Yes it’s an interesting topic with many angles.

You’ve probably already discarded this but:
Could you perhaps measure liquid input / output temperature and using the specific heat capacity of the liquid to get the temperature ? delta_E = mass * delta_T * C_p

Before nvidia-smi, I recall playing around with lm_sensors to read the temperature sensors on older cards. I would take a look at the bus scanning in the lm_sensors package and see if the GPU exposes an I2C bus to the operating system. That might give you more direct access.

(Sorry I can’t be more specific, as this was several years ago…)

If I recall correctly, lm_sensors looks for the presence of nvidia-settings and uses that to get the GPU temperature. I don’t remember the precision of the nvidia-settings temperature readout, but the problem I had with it was that it refused to run (even in command line only mode) if X-windows was not running.