Fan speed control with 375.26, Ubuntu 14.04, headless

Hi,

We have a new GPU server running on Ubuntu 14.04 (x86_64) with current kernel (4.4.0-59-generic). Setup is a new build in rack mounted casing with industrial level cooling (Supermicro SYS-4028GR-TR). The setup contains four Titan X (Pascal) cards having drivers 375.26. No displays attached.

The problem is that the Fan speeds stay on mid range levels (~40-60% according to nvidia-smi) under heavy load and the GPU temperatures are at thermal limit (82-85C) leading to decreased performance. The casing stays at about 40C. Attempting to run dummy X and setting the target fan speed to any given value results in new fan setting (i.e. nvidia-settings -a [gpu:0]/GPUFanControlState=1 -a [fan:0]/GPUTargetFanSpeed=100") - but the Performance mode is automatically set to P8 (adaptive mode is usually at P2 under load) leading to very low performance. Setting -a [gpu:0]/GPUPowerMizerMode=2 with the fan control settings appear to keep the performance up for a few seconds at P2 and then it falls back to P8 state. Persistent mode is switched on.

Applying the settings with headless setup is done here with a script from:
https://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness#TOC-Faking-a-Head-for-a-Headless-X-Server . Few small changes for Titan X pascal was needed.

The performance drop from thermal limiting is about 20% within few minutes starting utilization. With the fan speed control the drop is about 75%. Otherwise everything appear to be working fine.

Is there a working way to force the fan speeds to 100% (or 95%) with a headless server setup to keep the temperatures lower yet with the performance up? Or - could this be a driver bug or are we doing something wrong?

I have a post about this in ubuntu forum-br look;

http://ubuntuforum-br.org/index.php/topic,117588.msg647417.html#msg647417

http://ubuntuforum-br.org/index.php/topic,117459.msg646837.html#msg646837

My Portuguese is “a bit” rusty. I hope that you noticed the “headless” ie no displays attached on my post.

In addition we have started to see some random/semirandom freezing of the system. These appear to most frequently when starting or stopping some GPU computations. MemTest86 found no errors in 24hour testing. System has redundant power supply (4x1.6kV) and IMPI shows no problems there. Also the system generic cool appears to be working fine.

To verify that the problem is not drivers we started testing with Windows 2016 server. The system appears to be more stable under Windows. Also the fan speed control works (using msi Afterburner). I suppose the issue is driver related on linux. (We may have had also some power related issues - maybe… but really still unverified).

Short update - our strange stability issues appear to have been solved by installing VMware and Ubuntu 14.04.1 on top of that. Don’t ask why.

Also the fan control problem was solved by setting up a dummy display for each card at boot time so that they exist all the time. Now things appear to work more or less OK. And the current driver is 378.09 - dunno if that changed anything. But up and apparently stable now…

Final update : If you experience (power management related) stability issues with NVidia Pascal cards - be sure to check that the PCI bridges support the GPU cards. In our case flashing PLX EEPROM(s) according to SuperMicro support (and also the main BIOS so that we could flash the PLX) fixed our stability issues…

Hi, I want to ask a basic question, how do you set up a dummy X? What do you write in the xorg.conf?

I have three GPUs and only one of them is used for display, the other two are for computing. I want to be able to control fan speed of all three. But right now, only the one who displays has the fan speed controllable. I think maybe I need to set up dummy X for the remaining two GPUs. Is that the way to do it?

Right now, I have xserver-xorg-video-dummy installed, and I have three GPU Devices, one Monitor, and one screen in my xorg.conf. How do I write the other two dummy displays for the other two GPUs?

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Videocard0"
    BusID          "PCI:2:0:0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
EndSection

Section "Device"
    Identifier     "Videocard1"
    BusID          "PCI:3:0:0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
EndSection

Section "Device"
    Identifier     "Videocard2"
    BusID          "PCI:129:0:0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Videocard1"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "Coolbits" "4"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Just an update, I think I’ve figured out a handy solution to my own question. From this website: https://foldingforum.org/viewtopic.php?f=16&t=25075

The following two lines of code would make it possible to adjust fan speed of multiple GPUs.

nvidia-xconfig --enable-all-gpus
nvidia-xconfig --cool-bits=4

Hi,

we have the same problem with one 4028GR-TR and 8 Titan X pascal.
can you send us the EEPROM plx flash?

best regards