TitanX slower than CPU (Tensorflow), possible configuration issue

I just set up a TitanX machine to replace an AWS EC2 instance I’ve been using with Tensorflow. I noticed that it’s actually exponentially slower to run on this TitanX than it is to run on CPU (32 cores(threads)) on this box. Something is obviously wrong.

The only thing I could see was when running nvidia-smi -q I see:

HW Slowdown : Active

I can’t find much information on “HW Slowdown” other than what it says in the docs

The GPU was 100% utilized, I attempted to manually raise the fan speed to see if it was a temperature issue but even down near 42C it was still slow.

==============NVSMI LOG==============

Timestamp                           : Tue Apr 12 00:32:40 2016
Driver Version                      : 361.42

Attached GPUs                       : 1
GPU 0000:84:00.0
    Product Name                    : GeForce GTX TITAN X
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0420116032145
    GPU UUID                        : GPU-cb55b961-e76a-15fc-2abc-412687da3242
    Minor Number                    : 0
    VBIOS Version                   : 84.00.45.00.90
    MultiGPU Board                  : No
    Board ID                        : 0x8400
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.01.03
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x84
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x17C210DE
        Bus Id                      : 0000:84:00.0
        Sub System Id               : 0x29923842
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 22000 KB/s
        Rx Throughput               : 61000 KB/s
    Fan Speed                       : 100 %
    Performance State               : P2
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 12287 MiB
        Used                        : 11763 MiB
        Free                        : 524 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 4 MiB
        Free                        : 252 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 100 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 56 C
        GPU Shutdown Temp           : 97 C
        GPU Slowdown Temp           : 92 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 241.50 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 275.00 W
    Clocks
        Graphics                    : 1328 MHz
        SM                          : 1328 MHz
        Memory                      : 3304 MHz
        Video                       : 1227 MHz
    Applications Clocks
        Graphics                    : 1126 MHz
        Memory                      : 3505 MHz
    Default Applications Clocks
        Graphics                    : 1126 MHz
        Memory                      : 3505 MHz
    Max Clocks
        Graphics                    : 1519 MHz
        SM                          : 1519 MHz
        Memory                      : 3505 MHz
        Video                       : 1397 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes
        Process ID                  : 23088
            Type                    : C
            Name                    : python
            Used GPU Memory         : 11736 MiB

That didn’t seem like it was throttling the clock, I’m not even sure what to look for from here. Any help would be appreciated

That’s rather unusual on a Titan X. What is the history of this GPU? Where did you buy it, and when?

I have never seen this “HW Slowdown”, but given that instantaneous power reported by nvidia-smi (240W) is very close to the “enforced power limit” of 250W, it stands to reason that there may have been power spikes that exceeded the 250W limit and thereby triggered a “HW slowdown”. The current power state of P2 would seem to indicate the GPU is in a lower-power state; at full performance I would expect it to be in P0 state.

Based on the nvidia-smi output, it seems you should be able to configure a high power limit (up to 275W) for this GPU to avoid this. Giving the good cooling (only 56 deg C) that seems like a safe thing to do, assuming your PSU is up to the additional power draw. nvidia-smi has the switch --power-limit=[limit in watts] to set the limit.

I attempted setting the max power limit to 275W, I tried setting the PowerMizer setting to 0, 1 and 2 and forced the fans to 100%.

I, too, was thinking the P2 power mode was possibly the culprit but I’m not sure how to alleviate that.

The card is an EVGA, brand new from Amazon. Server is running in a dual xeon 8core(16thread ea., 32thread max) server (Dell T630). Redundant 1100W PSU.

Here is my nvidia-smi -q with fan set to 100% and power limit set to 275:

==============NVSMI LOG==============

Timestamp                           : Tue Apr 12 13:01:58 2016
Driver Version                      : 361.42

Attached GPUs                       : 1
GPU 0000:02:00.0
    Product Name                    : GeForce GTX TITAN X
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0420116032145
    GPU UUID                        : GPU-cb55b961-e76a-15fc-2abc-412687da3242
    Minor Number                    : 0
    VBIOS Version                   : 84.00.45.00.90
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.01.03
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x17C210DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x29923842
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 17000 KB/s
        Rx Throughput               : 63000 KB/s
    Fan Speed                       : 100 %
    Performance State               : P2
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 12287 MiB
        Used                        : 11763 MiB
        Free                        : 524 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 4 MiB
        Free                        : 252 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 99 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 45 C
        GPU Shutdown Temp           : 97 C
        GPU Slowdown Temp           : 92 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 245.49 W
        Power Limit                 : 275.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 275.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 275.00 W
    Clocks
        Graphics                    : 1341 MHz
        SM                          : 1341 MHz
        Memory                      : 3304 MHz
        Video                       : 1234 MHz
    Applications Clocks
        Graphics                    : 1126 MHz
        Memory                      : 3505 MHz
    Default Applications Clocks
        Graphics                    : 1126 MHz
        Memory                      : 3505 MHz
    Max Clocks
        Graphics                    : 1519 MHz
        SM                          : 1519 MHz
        Memory                      : 3505 MHz
        Video                       : 1397 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes
        Process ID                  : 4606
            Type                    : C
            Name                    : python
            Used GPU Memory         : 11736 MiB

With the power limit increased, your GPU seems to have boosted clocks slightly higher than before (1341 vs 1328), and the power consumption has increased a tad as well (245 vs 240). That makes sense. But the “P2” state and “HW Slowdown Active” does not make sense to me given that the card seems to be running flat out based on utilitisation, clocks, and power consumption reported by nvidia-smi.

Power state management involves software. I wonder whether that may have gotten into a weird state. Have you tried power-cycling the machine? Also, you might want to check you have the latest official EVGA VBIOS for this card, in case there have been firmware bugs (unlikely at this stage, I would think). Other than that, I am out of ideas of what to try.

Strangely, the clocks on this GTX Titan X seem to be much higher than the boost clocks stated on the EVGA website for the fastest model of Titan X they sell (http://www.evga.com/Products/ProductList.aspx?type=0&family=GeForce+TITAN+Series+Family&chipset=GTX+TITAN+X), which is confusing if not suspicious.

I’ve definitely cycled the machine quite a few times, I updated drivers. I didn’t try different versions of cudnn or cuda or different builds of tensorflow yet, I’m going to focus on that now and see what happens.

So, I downgraded cuda from 7.5 to 7.0 and cudnn from 5.0 to 4.0… and installed the driver that was packaged with cuda 7… fixed it.

Good to hear that the issue turned out to be fixable in this way, although in general I would not recommend CUDA 7.0, that version had too many bugs. I upgraded straight from CUDA 6.5 to CUDA 7.5 because of that.

Given the weird power state handling, the most likely source of this trouble would appear to be the driver package, I can’t imagine how either CUDA or CUDNN would play into it. So if you have sufficient time on your hands, you might want to try cycling through all recent driver packages to see whether there is one that does not exhibit this issue and also allows you to run CUDA 7.5.

I found an open issue on Tensorflow github where someone with a TitanX was having slow execution, I believe this is might be more a strange combination of incompatibilities between Tensorflow, CUDA 7.5 and maybe the drivers or at least the card.

I’ll have to play around with CUDA versions, CuDNN version and driver versions to see. When I was initially seeing this issue I was actually running the drivers prepackaged with CUDA 7.5 and upgraded the drivers to see if it fixed the issue. I’m now more leaning towards this being a Tensorflow issue.

Having the exact same issue. I tried multiple different driver versions but that didn’t help, the behaviour remained. How could that potentially be a problem with CUDA + cuDNN ??! (Downgrading to CUDA 7.0 has solved the problem for me previously, but that requires me to recompile tensorflow and I’d like to use the precompiled binaries).