I am trying to do some performance analysis on GPU (GTX 1070, on Ubuntu Linux) and I can see that initially on cold start, the runtimes are high, but after running tasks for some time, their runtime decreases and stabilize. I suspect this might be due to power management (clocking up/down) in GPU by the driver.
Is there a way to disable this? If not, then
What can be done to minimize its impact and get better performance analysis?
Are there any other GPUs (other than GTX 1070) that have better power management control given to user?
This sounds more like a case of automatic clock boosting while staying in the same power state, than transitions between different power states. A look at nvidia-smi output could probably confirm that (it shows the power state).
The Tesla line of professional GPUs offers (or at least offered, I haven’t used recent SKUs) a collection of application clocks that users can select from with nvidia-smi -ac. The idea being this is that in a cluster of GPU-accelerated machines (typical deployment environment of Tesla GPUs) it causes problems with work distribution etc, when different GPUs run at different auto-boost clocks due to variations in temperature and power usage.
A reasonably practical alternative may be to exploit the effect you have observed: “their runtime decreases and stabilizes”. This still leaves the problem of clock variations over longer period of times, e.g. caused by different GPU temperatures due to differences in ambient temperature on different days, or at different times of the day.
Since CPUs also have automatic clock boosting (e.g. mine boosts anywhere from 3.5 GHz to 3.9 GHz based on a number of parameters), without any way of directly controlling it as a user from what I can tell, this issue has a wider scope. I expect automated clock boosting to get ever more intricate, with wider ranges of possible clocks, as manufacturers try to squeeze maximum performance out of silicon after the death of Moore’s Law.
That clock speeds have stopped scaling is pretty clear. Actually it has stopped much earlier than 2006.
2002 Pentium IV: 3 GHz
2018 Intel Core i7-8086k: 5 GHz… only in Turbo mode on a few cores. 4GHz is base clock.
But we’ve continued to downscale transistors (keeping power density on the chip area approximately constant) - which is what Moore’s law is all about. What you could argue is that transistor scaling has slowed down recently - and hence this marks the end of Moore’s law.
[url]https://www.fool.com/investing/2018/04/11/is-intel-corp-ceo-brian-krzanich-to-blame-for-its.aspx[/url]
In the years since Krzanich took the CEO role, however, the company’s manufacturing efforts have been, to put it mildly, poor. The ramp up of the company’s 14-nanometer manufacturing technology was both late and highly problematic from a technology and financial point of view, and the company’s follow-up 10-nanometer technology is still, as of this writing, missing in action despite being originally slated to go into mass production more than three years ago.
While some foundries are doing a bit better than Intel at the moment, best I can tell, Moore’s Law is pretty much dead. This is not owed to technical issues alone, but in combination with financial factors (a green-field fab will run you $10B, what could one produce in it to recoup the investment and turn a profit?). From here on out the performance game will consist mostly in refining microarchitectures and optimizing software, ideally in the form of cohesive hardware/software co-development.
Personally, I believe NVIDIA is very well positioned to excel at that game.
GTX970 was the best consumer level card as far as still having nvidia-smi features similar to what a Tesla compute card does (app clocks, real P-state locking and controls, etc) before nVidia caught on and hobbled the next gen to drive sales of compute-approved type cards.
The 10xx series are lobotomized in the driver and will never run P0 during compute sessions. If you use the clock offsets to gain some clock control, and then exit the compute session usually the card will jump from P2->P0->P8 instead of just P2->P8 in which case it can hang or crash bus because P2base+offset is sometimes much much lower than P0base+offset and thus when it visits P0 momentarily it will painfully overclock into crashing.
Run windows and then find nvidiaprofileinspector which allows you to turn off this Force-P2 silliness at least (tweak base global profile, apply, reboot - repeat every time you change driver versions too, it gets changed back). You still get no good features from nvidia-smi but at least the clocking is more controllable.
P2 lock was done allegedly so that results would be more reliable, but that doesn’t allow for people with use-case where speed is everything and corrupt results can be easily validated and tossed (and compute rate minus a few bad apples is still higher than compute rate locked in P2). Also I have some PNY 1060 cards where P2 === P0 clocking anyways, those work nice in Linux since effectively its always P0 even if the driver asks for P2. These other single fan MSI ones have seriously stupid slow P2 settings in bios so I must run them in windows only, or suffer 20% performance hit which is a big problem. I already tried applying the windows driver nvreg key by hex handle into the linux nvreg but it didn’t have an effect (as advertised elsewhere, feature not in the linux driver at all).
Not every compute app is computer vision or whatever, guys. At least implement the same setting in the Linux drivers… kind of tired of running windows just to get full speed out of these cards… probably lose 20% again to rebooting and general windows being windows…
I suppose thats the main reason for calc-fast-check-results-later, so ya got me (ethereum cranker)
but I sweeeeaar there are other use-cases for such! there must be!
Why not just tell the science types to underclock if they like accuracy, instead of putting the cards on crutches for everyone? Oh right, marketing dept forced it. The actual death of Moores Law is caused by marketing departments.
When you don’t make artificial price points and just make the fastest widget you can possibly make, things improve at the natural rate (approaching Moores Law) however when profit extraction requires:
slowly…stepping…through…product levels…and feature set…combos…and making fake feature sets by software strapping (remember AMD cpus with a whole core available to hack-unlock which worked fine? or AMD hawaii gpu with 4 disabled shader units that could be flipped back on via flash? neither of these was a QA-binning thing as much as they like you to think so)
I think Intel just makes the 5GHz core and then sets the clocking for whatever market niches the marketing dept says are packed with rubes and loose wallets.
But I love nVidia mainly by proxy as I used to love 3dfx the most but you guys ate them.