We have a new GPU server running on Ubuntu 14.04 (x86_64) with current kernel (4.4.0-59-generic). Setup is a new build in rack mounted casing with industrial level cooling (Supermicro SYS-4028GR-TR). The setup contains four Titan X (Pascal) cards having drivers 375.26. No displays attached.
The problem is that the Fan speeds stay on mid range levels (~40-60% according to nvidia-smi) under heavy load and the GPU temperatures are at thermal limit (82-85C) leading to decreased performance. The casing stays at about 40C. Attempting to run dummy X and setting the target fan speed to any given value results in new fan setting (i.e. nvidia-settings -a [gpu:0]/GPUFanControlState=1 -a [fan:0]/GPUTargetFanSpeed=100") - but the Performance mode is automatically set to P8 (adaptive mode is usually at P2 under load) leading to very low performance. Setting -a [gpu:0]/GPUPowerMizerMode=2 with the fan control settings appear to keep the performance up for a few seconds at P2 and then it falls back to P8 state. Persistent mode is switched on.
Applying the settings with headless setup is done here with a script from:
https://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness#TOC-Faking-a-Head-for-a-Headless-X-Server . Few small changes for Titan X pascal was needed.
The performance drop from thermal limiting is about 20% within few minutes starting utilization. With the fan speed control the drop is about 75%. Otherwise everything appear to be working fine.
Is there a working way to force the fan speeds to 100% (or 95%) with a headless server setup to keep the temperatures lower yet with the performance up? Or - could this be a driver bug or are we doing something wrong?