Cooling on two Titan X (Pascal) and fan speed

I have two Titan X (Pascal) used for machine learning, probabilistic fiber tracking and other processes that will run for days or weeks at a time. Every time I check nvidia-smi on that system I notice GPU1 is much hotter than GUP0.

Tue Nov 8 11:01:15 2016
±----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 0000:03:00.0 Off | N/A |
| 39% 66C P2 62W / 250W | 9112MiB / 12189MiB | 22% Default |
±------------------------------±---------------------±---------------------+
| 1 TITAN X (Pascal) Off | 0000:04:00.0 On | N/A |
| 59% 85C P2 115W / 250W | 9249MiB / 12189MiB | 0% Default |
±------------------------------±---------------------±---------------------+

Given the equal utilization I don’t really understand why this would be. We have another system running GTX 980s, that when running the same program behaves as follows:

Tue Nov 8 10:55:31 2016
±----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Off | 0000:02:00.0 Off | N/A |
| 0% 31C P0 39W / 180W | 0MiB / 4037MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 980 Off | 0000:03:00.0 Off | N/A |
| 33% 56C P2 66W / 180W | 1861MiB / 4037MiB | 42% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 980 Off | 0000:83:00.0 Off | N/A |
| 33% 56C P2 76W / 180W | 1866MiB / 4037MiB | 9% Default |
±------------------------------±---------------------±---------------------+
| 3 GeForce GTX 980 Off | 0000:84:00.0 Off | N/A |
| 32% 54C P2 72W / 180W | 1881MiB / 4037MiB | 43% Default |
±------------------------------±---------------------±---------------------+

Note the similar temps across the currently used GPUs. Is there something I should be concerned about with respect to the Titan X above running that much hotter than the other card?

Thank you,

Keith

I don’t think this is necessarily something to worry about. Quick experiment: Physically swap the two GPUs involved. Does the “hot spot” follow the GPU, or does it stay with the slot position?

[follows the GPU]
hypothesis (1): different VBIOS versions with different fan profiles
hypothesis (2): different power dissipation due to
— (2a) different vendor, or different vendor SKUs
— (2b) normal manufacturing tolerances in the components including the GPU itself

[stays with the slot]
hypothesis (1) Unequally distributed workload (note “GPU-Util” in the log you showed) <-----
hypothesis (2) Insufficient or turbulent airflow around the hotter card.

[Later:] Ooops. I notice belatedly that the GPU with the lower utilization (0%) shows very high power usage (115W). That makes no sense. The sensor output shown by nvidia-smi is not instantaneous so maybe this is an artifact of the sensor query process, where the 115W of power consumption refers to a lightly earlier time frame than GPU utilization, which now has fallen to 0% since the workload finished? Try monitoring continuously to see whether the numbers make more sense.