I have few GPUs, but only one has such strange behaviour.
Performance state still P2, Utilization still 100%, even temperature is the same.
dmesg shows no errors.
WTF?
Do all GPUs have the same specification (same brand, same SKU, same VBIOS version)? If not you would be comparing apples and oranges. What type of GPUs are these?
I have no idea what this means.
[1] There were two time periods during which performance was lower? If so, how long did they last, and what was the total runtime of the application?
[2] The performance was cut in half? How was performance measured? Was performance low throughout the application run?
As far as I am aware, this is normal for certain newer consumer GPUs when executing compute tasks.
The most common reason for unexpected performance drops is throttling. Either because of temperature, or because of power. I am using actively cooled cards and my observation is that throttling because of power is the common type of throttling (monitoring with GPU-Z seems to indicate that power throttling activates at 90% of the GPU’s power rating, and kicks in fast enough to keep power < 95% of the power rating. I have yet to see temperature throttling on my GPU: The fan speed increases, sometimes to 100% to keep the GPU under the temperature threshold It could be a different story if your ambient temperature is regularly in excess of 85 deg. Fahrenheit.
NOTE: Every type of GPU has different thresholds for power and temperature at which throttling kicks in. Sometimes these are user adjustable, sometimes not. Also, even identical GPU models that are identically configured may show different throttling behavior: the sensors (power, temperature) are not calibrated and usually accurate only to about +/-5%; the hardware itself (including all the chips) has manufacturing variations causing differences in power draw and thus heat generation; the environment in which the GPU is operating may not be exactly the same, e.g. some spots in the enclosure can be warmer, others colder
Hi, thanks for your replay
Both cards are same vendor, same VBIOS version.
Unfortunately I don’t understand what SKU means, but hope it also the same.
Both cards are connected through PCIe x8 slots
The point is that during learning process I output average count of batches per second and this count can suddenly drop down from 5.1 to 2.6 batches per second. Process restart helps nothing - only reboot.
You mentioned power throttling - how can I check it? May be with nvidia-smi or something like that?
Here nvidia-smi -q output
==============NVSMI LOG==============
Timestamp : Sat Nov 10 10:04:51 2018
Driver Version : 390.48
Attached GPUs : 2
GPU 00000000:17:00.0
Product Name : GeForce GTX 1080 Ti
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-e80d5214-f7b0-a41b-c850-54bddaef9a34
Minor Number : 0
VBIOS Version : 86.02.39.00.9E
MultiGPU Board : No
Board ID : 0x1700
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x17
Device : 0x00
Domain : 0x0000
Device Id : 0x1B0610DE
Bus Id : 00000000:17:00.0
Sub System Id : 0x37511458
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 11178 MiB
Used : 2452 MiB
Free : 8726 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 53 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 21.56 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 375.00 W
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2037 MHz
SM : 2037 MHz
Memory : 5616 MHz
Video : 1620 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1179
Type : C
Name : /usr/bin/python3
Used GPU Memory : 189 MiB
Process ID : 47265
Type : C
Name : /home/username/torch/install/bin/luajit
Used GPU Memory : 1551 MiB
Process ID : 47300
Type : C
Name : /home/username/torch/install/bin/luajit
Used GPU Memory : 291 MiB
Process ID : 47797
Type : C
Name : /usr/bin/python3
Used GPU Memory : 189 MiB
Process ID : 47982
Type : C
Name : /usr/bin/python3
Used GPU Memory : 189 MiB
GPU 00000000:65:00.0
Product Name : GeForce GTX 1080 Ti
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-f8b9e3a1-f328-2790-8224-ca0a6db01c46
Minor Number : 1
VBIOS Version : 86.02.39.00.9E
MultiGPU Board : No
Board ID : 0x6500
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x65
Device : 0x00
Domain : 0x0000
Device Id : 0x1B0610DE
Bus Id : 00000000:65:00.0
Sub System Id : 0x37511458
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 33 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 11175 MiB
Used : 2083 MiB
Free : 9092 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 1 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 58 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 23.81 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 375.00 W
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2037 MHz
SM : 2037 MHz
Memory : 5616 MHz
Video : 1620 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1112
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 20 MiB
Process ID : 1166
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 12 MiB
Process ID : 47265
Type : C
Name : /home/username/torch/install/bin/luajit
Used GPU Memory : 291 MiB
Process ID : 47300
Type : C
Name : /home/username/torch/install/bin/luajit
Used GPU Memory : 1715 MiB
Yes, nvidia-smi can tell you whether throttling is taking place, and what the reason for it is. My preferred tool on Windows is GPU-Z which graphically displays GPU sensor data including throttling, not sure whether there is something like that for Linux.
The nvidia-smi output you posted shows one GPU using a 8x link, the other a 16x link (“Link Width / Current”). You might want to double-check whether the “slow” GPU is the one with the 8x link and the “fast” one the one with the 16x link.
Other than that I don’t see anything that hints at a performance difference.
Yep, now cards have different connections, but If I even replace them - it doen’t help.
And again, performance drop occurs suddenly during learning process.
Here log example:
[180522 05:21:45] On #64000, 5.05 (iter/s), grad(35/37)=85.42/54.9, dScale=0, param size=125.646, length=191.3, cosine=0.0007, train loss/acc=0.317/0.953/0.852/0.668/0.439, lr=2.1e-05
[180522 05:25:03] On #65000, 5.04 (iter/s), grad(35/37)=83.03/50.45, dScale=0, param size=125.645, length=181.568, cosine=0.0008, train loss/acc=0.3449/0.952/0.853/0.666/0.437, lr=2.1e-05
[180522 05:28:21] On #66000, 5.04 (iter/s), grad(35/37)=85.13/53.31, dScale=0, param size=125.643, length=192.365, cosine=0.0007, train loss/acc=0.3069/0.953/0.847/0.667/0.442, lr=2.1e-05
[180522 05:31:39] On #67000, 5.04 (iter/s), grad(35/37)=86.82/55.09, dScale=0, param size=125.642, length=191.005, cosine=0.0007, train loss/acc=0.3327/0.952/0.851/0.668/0.439, lr=2.1e-05
[180522 05:34:58] On #68000, 5.05 (iter/s), grad(35/37)=90.06/54.91, dScale=0, param size=125.641, length=191.749, cosine=0.0008, train loss/acc=0.3147/0.953/0.852/0.669/0.434, lr=2.1e-05
[180522 05:38:16] On #69000, 5.05 (iter/s), grad(35/37)=85.26/55.33, dScale=0, param size=125.64, length=198.492, cosine=0.0008, train loss/acc=0.3229/0.953/0.853/0.668/0.434, lr=2.1e-05
[180522 05:41:34] On #70000, 5.04 (iter/s), grad(35/37)=90.59/56.58, dScale=0, param size=125.639, length=182.06, cosine=0.0008, train loss/acc=0.296/0.957/0.855/0.672/0.442, lr=2.1e-05
[180522 05:44:52] On #71000, 5.04 (iter/s), grad(35/37)=78.7/48.23, dScale=0, param size=125.637, length=199.971, cosine=0.0008, train loss/acc=0.3038/0.954/0.851/0.666/0.431, lr=2.1e-05
[180522 05:48:10] On #72000, 5.04 (iter/s), grad(35/37)=81.19/52.77, dScale=0, param size=125.636, length=187.55, cosine=0.0008, train loss/acc=0.3278/0.952/0.851/0.668/0.44, lr=2.1e-05
[180522 05:51:28] On #73000, 5.04 (iter/s), grad(35/37)=92.3/57.81, dScale=0, param size=125.635, length=186.046, cosine=0.0007, train loss/acc=0.3118/0.952/0.851/0.663/0.434, lr=2.1e-05
[180522 05:54:46] On #74000, 5.04 (iter/s), grad(35/37)=90.86/58.24, dScale=0, param size=125.634, length=195.341, cosine=0.0007, train loss/acc=0.3141/0.953/0.844/0.667/0.439, lr=2.1e-05
[180522 05:59:05] On #75000, 3.87 (iter/s), grad(35/37)=74.81/46.29, dScale=0, param size=125.632, length=202.407, cosine=0.0007, train loss/acc=0.3057/0.954/0.851/0.661/0.44, lr=2.1e-05
[180522 06:04:03] On #76000, 3.34 (iter/s), grad(35/37)=68.14/43.71, dScale=0, param size=125.631, length=189.243, cosine=0.0007, train loss/acc=0.2879/0.952/0.855/0.665/0.437, lr=2.1e-05
[180522 06:09:02] On #77000, 3.34 (iter/s), grad(35/37)=86.72/53.36, dScale=0, param size=125.63, length=197.349, cosine=0.0007, train loss/acc=0.3149/0.955/0.853/0.669/0.444, lr=2.1e-05
[180522 06:14:01] On #78000, 3.34 (iter/s), grad(35/37)=88.79/55.11, dScale=0, param size=125.629, length=185.043, cosine=0.0007, train loss/acc=0.3473/0.951/0.849/0.664/0.436, lr=2.1e-05
[180522 06:18:59] On #79000, 3.34 (iter/s), grad(35/37)=88.85/54.61, dScale=0, param size=125.628, length=194.873, cosine=0.0007, train loss/acc=0.3121/0.952/0.853/0.671/0.436, lr=2.1e-05
[180522 06:23:58] On #80000, 3.35 (iter/s), grad(35/37)=86.09/53.88, dScale=0, param size=125.626, length=202.949, cosine=0.0008, train loss/acc=0.3306/0.954/0.847/0.663/0.432, lr=2.1e-05
The question we need an answer to is: Does the performance loss correlate with the link width? If the answer is yes, you need to figure out why the link widths are different (BIOS setup, mechanical or electrical issues with the slot, etc). If the answer is no, the link width is a red herring and the root cause is elsewhere.
I am not familiar with your application and the benchmark, so I can’t easily come up with other hypotheses to explore. Is this a system with dual CPUs by any chance? If so, make sure to use numactl or a similar tool to bind CPU and memory, such that each GPU talks to the “close” CPU.
I have only 1 CPU in system.
I already checked that even replacement of cards change nothing in their behaviour - problem attached to GPU UUID, not to PCIe expansion slot.
If you physically swapped the cards and the “slowness” follows the card, the issue is with the card. When you monitor the cards with nvidia-smi during the benchmark run, do you notice anything change at the point where performance drops?
This could include the PCIe link width, because as far as I know the operational mode is negotiable between the GPU and the motherboard (within the maximum capabilities reported by nvidia-smi). If you examine the slot connector on the GPU carefully, is there any evidence that it is worn, or dirty, oily? I hypothesize that the signal integrity may be marginal, causing PCIe link width to be reduced after a certain amount of running the app.
Electronics can be damaged by static discharge during the handling of he hardware, which is why a conductive wrist strap is recommended when performing such work. It is possible that your slow GPU has suffered some sort of damage, but such things are pretty much impossible to diagnose remotely.
In your experiments, how much time typically elapses between the start of the application and the point where the GPU slows down?
By the way, what is the power rating of your power supply (PSU)? Are you using any 6-pin to 8-pin converters in the power cables going to the GPU? Any Y-splitters?
It quite difficult to catch this moment. I checked kernel log and dmesg - both are clean at the moment.
What do you mean by power rating?
I use 8-pin connectors from PSU package, no additional convertors of any type.
Usually it takes from few hours of 100% load to few days before drop.
PSUs have a wattage printed on them: A 600W PSU, a 1000W PSU, a 1600W PSU. This “size” should be prominently printed on the box the PSU came in, it should also be printed on the label glued to the PSU.
Yours should be rated for 1000W if you want rock solid operation. This is based on a rule of thumb that says the rated wattage of all system components should not exceed 60% of the PSU power rating. You have 250W x2 from the GPU + maybe 120W for a single CPU motherboard+peripherals (it could be a bit more if you have tons of DRAM or a really high-end CPU). So 620W total for the system → 1000W PSU. The 60% rule provides significant headroom to absorb short-term spikes in power draw from either CPU or GPU.
I would recommend a PSU that complies with 80 PLUS Platinum specs, with 80 PUS Gold as the minimum standard for a workstation that works under full load for days at a time. If you run that machine continuously, you should make back the additional expense of the Platinum PSU vs Gold in a couple of years through electrical power savings (this will depend on PSU prices and cost of electricity in your location).
Does the system operate in an environment where there could be significant vibrations, a ship or a factory floor, for example? That could impact the reliability of PCIe slot connectors, for example.
Hours to days to “failure” would seem to exclude a thermal issue, as those should manifest after 30 minutes at most. So I am suspecting some marginal electrics, either reliability of power supply, issues with signal integrity or damage to chips.
Oh way, you mean power :^)
My PSU has 1.5kW seems quite enough for 2 1080ti
Yes, 1500 W PSU should be more than enough. So power supply should not be an issue.
I wonder whether electromagnetic interference could be an issue, with one GPU more susceptible than the other. Any large electromagnetic devices near by (large electrical motors, MRI machines, …)? I assume the system is running in a proper enclosure consisting (mostly) of sheet metal?
[Sorry, need to stop here, falling asleep at the keyboard]