GPU performance suddenly drops down twice during learning

I have few GPUs, but only one has such strange behaviour.
Performance state still P2, Utilization still 100%, even temperature is the same.
dmesg shows no errors.
WTF?

Do all GPUs have the same specification (same brand, same SKU, same VBIOS version)? If not you would be comparing apples and oranges. What type of GPUs are these?

I have no idea what this means.

[1] There were two time periods during which performance was lower? If so, how long did they last, and what was the total runtime of the application?

[2] The performance was cut in half? How was performance measured? Was performance low throughout the application run?

As far as I am aware, this is normal for certain newer consumer GPUs when executing compute tasks.

The most common reason for unexpected performance drops is throttling. Either because of temperature, or because of power. I am using actively cooled cards and my observation is that throttling because of power is the common type of throttling (monitoring with GPU-Z seems to indicate that power throttling activates at 90% of the GPU’s power rating, and kicks in fast enough to keep power < 95% of the power rating. I have yet to see temperature throttling on my GPU: The fan speed increases, sometimes to 100% to keep the GPU under the temperature threshold It could be a different story if your ambient temperature is regularly in excess of 85 deg. Fahrenheit.

NOTE: Every type of GPU has different thresholds for power and temperature at which throttling kicks in. Sometimes these are user adjustable, sometimes not. Also, even identical GPU models that are identically configured may show different throttling behavior: the sensors (power, temperature) are not calibrated and usually accurate only to about +/-5%; the hardware itself (including all the chips) has manufacturing variations causing differences in power draw and thus heat generation; the environment in which the GPU is operating may not be exactly the same, e.g. some spots in the enclosure can be warmer, others colder

Hi, thanks for your replay
Both cards are same vendor, same VBIOS version.
Unfortunately I don’t understand what SKU means, but hope it also the same.
Both cards are connected through PCIe x8 slots

The point is that during learning process I output average count of batches per second and this count can suddenly drop down from 5.1 to 2.6 batches per second. Process restart helps nothing - only reboot.

You mentioned power throttling - how can I check it? May be with nvidia-smi or something like that?

Here nvidia-smi -q output

==============NVSMI LOG==============

Timestamp                           : Sat Nov 10 10:04:51 2018
Driver Version                      : 390.48

Attached GPUs                       : 2
GPU 00000000:17:00.0
    Product Name                    : GeForce GTX 1080 Ti
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-e80d5214-f7b0-a41b-c850-54bddaef9a34
    Minor Number                    : 0
    VBIOS Version                   : 86.02.39.00.9E
    MultiGPU Board                  : No
    Board ID                        : 0x1700
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.01.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x17
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B0610DE
        Bus Id                      : 00000000:17:00.0
        Sub System Id               : 0x37511458
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 8x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 0 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 11178 MiB
        Used                        : 2452 MiB
        Free                        : 8726 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 5 MiB
        Free                        : 251 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 53 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 21.56 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 375.00 W
    Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 2037 MHz
        SM                          : 2037 MHz
        Memory                      : 5616 MHz
        Video                       : 1620 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 1179
            Type                    : C
            Name                    : /usr/bin/python3
            Used GPU Memory         : 189 MiB
        Process ID                  : 47265
            Type                    : C
            Name                    : /home/username/torch/install/bin/luajit
            Used GPU Memory         : 1551 MiB
        Process ID                  : 47300
            Type                    : C
            Name                    : /home/username/torch/install/bin/luajit
            Used GPU Memory         : 291 MiB
        Process ID                  : 47797
            Type                    : C
            Name                    : /usr/bin/python3
            Used GPU Memory         : 189 MiB
        Process ID                  : 47982
            Type                    : C
            Name                    : /usr/bin/python3
            Used GPU Memory         : 189 MiB

GPU 00000000:65:00.0
    Product Name                    : GeForce GTX 1080 Ti
    Product Brand                   : GeForce
    Display Mode                    : Enabled
    Display Active                  : Enabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-f8b9e3a1-f328-2790-8224-ca0a6db01c46
    Minor Number                    : 1
    VBIOS Version                   : 86.02.39.00.9E
    MultiGPU Board                  : No
    Board ID                        : 0x6500
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.01.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x65
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B0610DE
        Bus Id                      : 00000000:65:00.0
        Sub System Id               : 0x37511458
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 33 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 11175 MiB
        Used                        : 2083 MiB
        Free                        : 9092 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 5 MiB
        Free                        : 251 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 1 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 58 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 23.81 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 375.00 W
    Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 2037 MHz
        SM                          : 2037 MHz
        Memory                      : 5616 MHz
        Video                       : 1620 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 1112
            Type                    : G
            Name                    : /usr/lib/xorg/Xorg
            Used GPU Memory         : 20 MiB
        Process ID                  : 1166
            Type                    : G
            Name                    : /usr/bin/gnome-shell
            Used GPU Memory         : 12 MiB
        Process ID                  : 47265
            Type                    : C
            Name                    : /home/username/torch/install/bin/luajit
            Used GPU Memory         : 291 MiB
        Process ID                  : 47300
            Type                    : C
            Name                    : /home/username/torch/install/bin/luajit
            Used GPU Memory         : 1715 MiB

Yes, nvidia-smi can tell you whether throttling is taking place, and what the reason for it is. My preferred tool on Windows is GPU-Z which graphically displays GPU sensor data including throttling, not sure whether there is something like that for Linux.

The nvidia-smi output you posted shows one GPU using a 8x link, the other a 16x link (“Link Width / Current”). You might want to double-check whether the “slow” GPU is the one with the 8x link and the “fast” one the one with the 16x link.

Other than that I don’t see anything that hints at a performance difference.

Yep, now cards have different connections, but If I even replace them - it doen’t help.
And again, performance drop occurs suddenly during learning process.

Here log example:

[180522 05:21:45] On #64000, 5.05 (iter/s), grad(35/37)=85.42/54.9, dScale=0, param size=125.646, length=191.3, cosine=0.0007, train loss/acc=0.317/0.953/0.852/0.668/0.439, lr=2.1e-05
[180522 05:25:03] On #65000, 5.04 (iter/s), grad(35/37)=83.03/50.45, dScale=0, param size=125.645, length=181.568, cosine=0.0008, train loss/acc=0.3449/0.952/0.853/0.666/0.437, lr=2.1e-05
[180522 05:28:21] On #66000, 5.04 (iter/s), grad(35/37)=85.13/53.31, dScale=0, param size=125.643, length=192.365, cosine=0.0007, train loss/acc=0.3069/0.953/0.847/0.667/0.442, lr=2.1e-05
[180522 05:31:39] On #67000, 5.04 (iter/s), grad(35/37)=86.82/55.09, dScale=0, param size=125.642, length=191.005, cosine=0.0007, train loss/acc=0.3327/0.952/0.851/0.668/0.439, lr=2.1e-05
[180522 05:34:58] On #68000, 5.05 (iter/s), grad(35/37)=90.06/54.91, dScale=0, param size=125.641, length=191.749, cosine=0.0008, train loss/acc=0.3147/0.953/0.852/0.669/0.434, lr=2.1e-05
[180522 05:38:16] On #69000, 5.05 (iter/s), grad(35/37)=85.26/55.33, dScale=0, param size=125.64, length=198.492, cosine=0.0008, train loss/acc=0.3229/0.953/0.853/0.668/0.434, lr=2.1e-05
[180522 05:41:34] On #70000, 5.04 (iter/s), grad(35/37)=90.59/56.58, dScale=0, param size=125.639, length=182.06, cosine=0.0008, train loss/acc=0.296/0.957/0.855/0.672/0.442, lr=2.1e-05
[180522 05:44:52] On #71000, 5.04 (iter/s), grad(35/37)=78.7/48.23, dScale=0, param size=125.637, length=199.971, cosine=0.0008, train loss/acc=0.3038/0.954/0.851/0.666/0.431, lr=2.1e-05
[180522 05:48:10] On #72000, 5.04 (iter/s), grad(35/37)=81.19/52.77, dScale=0, param size=125.636, length=187.55, cosine=0.0008, train loss/acc=0.3278/0.952/0.851/0.668/0.44, lr=2.1e-05
[180522 05:51:28] On #73000, 5.04 (iter/s), grad(35/37)=92.3/57.81, dScale=0, param size=125.635, length=186.046, cosine=0.0007, train loss/acc=0.3118/0.952/0.851/0.663/0.434, lr=2.1e-05
[180522 05:54:46] On #74000, 5.04 (iter/s), grad(35/37)=90.86/58.24, dScale=0, param size=125.634, length=195.341, cosine=0.0007, train loss/acc=0.3141/0.953/0.844/0.667/0.439, lr=2.1e-05
[180522 05:59:05] On #75000, 3.87 (iter/s), grad(35/37)=74.81/46.29, dScale=0, param size=125.632, length=202.407, cosine=0.0007, train loss/acc=0.3057/0.954/0.851/0.661/0.44, lr=2.1e-05
[180522 06:04:03] On #76000, 3.34 (iter/s), grad(35/37)=68.14/43.71, dScale=0, param size=125.631, length=189.243, cosine=0.0007, train loss/acc=0.2879/0.952/0.855/0.665/0.437, lr=2.1e-05
[180522 06:09:02] On #77000, 3.34 (iter/s), grad(35/37)=86.72/53.36, dScale=0, param size=125.63, length=197.349, cosine=0.0007, train loss/acc=0.3149/0.955/0.853/0.669/0.444, lr=2.1e-05
[180522 06:14:01] On #78000, 3.34 (iter/s), grad(35/37)=88.79/55.11, dScale=0, param size=125.629, length=185.043, cosine=0.0007, train loss/acc=0.3473/0.951/0.849/0.664/0.436, lr=2.1e-05
[180522 06:18:59] On #79000, 3.34 (iter/s), grad(35/37)=88.85/54.61, dScale=0, param size=125.628, length=194.873, cosine=0.0007, train loss/acc=0.3121/0.952/0.853/0.671/0.436, lr=2.1e-05
[180522 06:23:58] On #80000, 3.35 (iter/s), grad(35/37)=86.09/53.88, dScale=0, param size=125.626, length=202.949, cosine=0.0008, train loss/acc=0.3306/0.954/0.847/0.663/0.432, lr=2.1e-05

The question we need an answer to is: Does the performance loss correlate with the link width? If the answer is yes, you need to figure out why the link widths are different (BIOS setup, mechanical or electrical issues with the slot, etc). If the answer is no, the link width is a red herring and the root cause is elsewhere.

I am not familiar with your application and the benchmark, so I can’t easily come up with other hypotheses to explore. Is this a system with dual CPUs by any chance? If so, make sure to use numactl or a similar tool to bind CPU and memory, such that each GPU talks to the “close” CPU.

I have only 1 CPU in system.
I already checked that even replacement of cards change nothing in their behaviour - problem attached to GPU UUID, not to PCIe expansion slot.

If you physically swapped the cards and the “slowness” follows the card, the issue is with the card. When you monitor the cards with nvidia-smi during the benchmark run, do you notice anything change at the point where performance drops?

This could include the PCIe link width, because as far as I know the operational mode is negotiable between the GPU and the motherboard (within the maximum capabilities reported by nvidia-smi). If you examine the slot connector on the GPU carefully, is there any evidence that it is worn, or dirty, oily? I hypothesize that the signal integrity may be marginal, causing PCIe link width to be reduced after a certain amount of running the app.

Electronics can be damaged by static discharge during the handling of he hardware, which is why a conductive wrist strap is recommended when performing such work. It is possible that your slow GPU has suffered some sort of damage, but such things are pretty much impossible to diagnose remotely.

In your experiments, how much time typically elapses between the start of the application and the point where the GPU slows down?

By the way, what is the power rating of your power supply (PSU)? Are you using any 6-pin to 8-pin converters in the power cables going to the GPU? Any Y-splitters?

It quite difficult to catch this moment. I checked kernel log and dmesg - both are clean at the moment.

What do you mean by power rating?
I use 8-pin connectors from PSU package, no additional convertors of any type.
Usually it takes from few hours of 100% load to few days before drop.

PSUs have a wattage printed on them: A 600W PSU, a 1000W PSU, a 1600W PSU. This “size” should be prominently printed on the box the PSU came in, it should also be printed on the label glued to the PSU.

Yours should be rated for 1000W if you want rock solid operation. This is based on a rule of thumb that says the rated wattage of all system components should not exceed 60% of the PSU power rating. You have 250W x2 from the GPU + maybe 120W for a single CPU motherboard+peripherals (it could be a bit more if you have tons of DRAM or a really high-end CPU). So 620W total for the system → 1000W PSU. The 60% rule provides significant headroom to absorb short-term spikes in power draw from either CPU or GPU.

I would recommend a PSU that complies with 80 PLUS Platinum specs, with 80 PUS Gold as the minimum standard for a workstation that works under full load for days at a time. If you run that machine continuously, you should make back the additional expense of the Platinum PSU vs Gold in a couple of years through electrical power savings (this will depend on PSU prices and cost of electricity in your location).

Does the system operate in an environment where there could be significant vibrations, a ship or a factory floor, for example? That could impact the reliability of PCIe slot connectors, for example.

Hours to days to “failure” would seem to exclude a thermal issue, as those should manifest after 30 minutes at most. So I am suspecting some marginal electrics, either reliability of power supply, issues with signal integrity or damage to chips.

Oh way, you mean power :^)
My PSU has 1.5kW seems quite enough for 2 1080ti

Yes, 1500 W PSU should be more than enough. So power supply should not be an issue.

I wonder whether electromagnetic interference could be an issue, with one GPU more susceptible than the other. Any large electromagnetic devices near by (large electrical motors, MRI machines, …)? I assume the system is running in a proper enclosure consisting (mostly) of sheet metal?

[Sorry, need to stop here, falling asleep at the keyboard]