NVIDIA downclocks my card when running OpenCL

Hi.

Since NVIDIA doesn’t even acknowledge OpenCL enough to give it a forum section, I had to write here. I have my PowerMizer set to maximum performance and it runs at performance level 3. When I start my OpenCL program the driver very conveniently downclocks the card to performance level 2. So I’m just wondering - how do I run my OpenCL program without having NVIDIA cripple its performance? The GPU utilization is 99% so it should automatically select maximum performance even on adaptive PowerMizer mode.

OpenGL runs at performance level 3, even on adaptive. Only OpenCL is crippled. What gives?

Wow, just wow.

If I run adaptive mode, fire up glxgears, the performance level goes to 3. Now I start my OpenCL program and it downclocks to performance level 2. When I then quit the OpenCL app, the performance level goes up to 3 again.

One would think that if glxgears alone is enough to trigger performance level 3, then glxgears + even more work would definitely trigger level 3.

Great job NVIDIA, you have successfully brainwashed the entire industry into using CUDA. Single vendor lockdown with tampered performance for any competitors. Scientology level brainwashing.

Actually, similar behavior (I think) occurs with CUDA codes, as discussed here:

[url]https://devtalk.nvidia.com/default/topic/892842/cuda-programming-and-performance/one-weird-trick-to-get-a-maxwell-v2-gpu-to-reach-its-max-memory-clock-/[/url]

Can you post the output of nvidia-smi while running the application to show the effect on clock speed? As txbob says, some top end clocks are reserved for graphics, and compute runs at a lower clock without manually boosting the clock using nvidia-smi.

I agree with txbob that this observation appears to be similar or even identical to the power management idiosyncrasies observed with CUDA last week. And without much additional information, I would not ascribe any sinister motives to this. The observed behavior may simply be a bug in the power management software, or the slightly lower clocks for compute applications may be enforced on purpose to assure the functional correctness for compute applications.

Compute applications have a different GPU usage profile compared to graphics applications, and may exercise different speed paths. Furthermore, the error tolerance in compute applications is very much lower than in graphics applications. A flipped bit in an hour of scientific simulation may well be fatal in terms of the final results, while a flipped bit in a hour of game play usually means a transient error of one incorrectly colored pixel out of a million pixels displayed for 1/60 of a second.

Thanks txbob for the link - it performs according to expectations now.

@njuffa: You cannot assume “graphics” to be less important than “computations”. You can do important stuff in OpenGL compute shaders and you can do nonsense throw-away stuff in CUDA. The point is, a faster clock speed that is supported by the card should be used at higher load. Correctness has nothing to do with the API, a supported clock speed should just work - with any API.

Thanks for the link and the info about CUDA.

I did not state or imply that “graphics” is less important than “computations”. In fact, from a financial perspective, if you look at NVIDIA’s quarterly revenue breakdown, you will find that “graphics” is more important than any other business area of the company, about an order of magnitude larger than “compute”.

I was merely stating that different use cases and usage profiles (which may be correlated to APIs as an initial proxy) stress different parts of a GPU differently, and thus may lead different usage profiles to qualify for different maximum clocks. This is speculation of course, but speculation informed by working on building processors earlier in my career. I also stated that the behavior observed could simply be due to a bug in the software. There is no way for us to know at this point whether this behavior is intentional or unintentional, and if intentional to determine what motivated it.

The use of incendiary language (“cripple”, “brainwash”) to start off a thread is usually not conducive to a helpful conversation on any topic.

You: “Furthermore, the error tolerance in compute applications is very much lower than in graphics applications.”

Me: “You cannot assume “graphics” to be less important than “computations”.”

You: “I did not state or imply that “graphics” is less important than “computations”.”

Right… I’ll just go ahead and remove my account now.

Btw, yes this is a bug. The driver shouldn’t downclock. The card is tested on, and sold as capable of running at a specific max frequency. The driver shouldn’t interfere with this. And no, the card doesn’t know if it is running “graphics” or “computations” (hint: they are both computations and they stress the same).

In that case, I would suggest filing a bug report. The bug reporting form is linked directly from the CUDA registered developer website.

There is zero reason why graphics and compute applications should stress the GPU in the same way. There are huge swaths of a GPU that are only used for compute and not graphics, and vice versa. For example, the rasterizer in graphics, and FP64 in compute, etc.

This is no bug. I see no place stated where any clock rate is guaranteed to run for a given application.

This behaviour is consistent with Intel’s practices, where the clock rate when running flat out vectorized AVX code is different than non AVX code.

I would think that shared memory is probably another unit not utilized by graphics. I haven’t used OpenGL in a decade, though, so things may have changed.

Good point. However, *ntel’s dishonesty does not warrant a similar behavior from NVIDIA or any other competitor for that matter!

Many would agree that *ntel’s habit of burying the AVX clock behavior deep in technical docs and omitting it in most (all?) marketing material and in commonly available specs, is essentially equivalent with making a false claim, a dishonest and misguiding practice.
Somewhat similarly, NVIDIA could and IMO should be more open about the clocking behavior of their devices, about their capabilities and configurability (or the lack of it). AFAIK there is very little docs about these aspects, let alone management tools to change some of the the behavior (and some of the behavior that can be modified is it made possible only through the X server settings GUI).

Note that I do agree that OpenGL and OpenCL/compute work does not stress the GPU the same way. In my experience compute load often leads to lower utilization and/or power consumption than graphics. This suggests that if anything, boost should be more aggressive in when running compute, doesn’t it? Admittedly I do not have a huge amount of data-points, though.

It is unpredictable how power/frequency scaling will affect different units without knowing the details. Even if the chip level utilization is lower for compute, for example, there may be a single unit that is only used by compute with a longer critical path that prevents more aggressive frequency scaling.

Processor designers have been weary of releasing fine-grained frequency and power scaling techniques because of these issues: some chips will run faster than others, and there has been a lot of worry that customers would complain about fairness, even though on average, all chips would run faster than if they didn’t use these techniques.

I have always wondered why GPU manufacturers do not use speed binning, which CPU manufactures have been using extensively for decades: Otherwise identical parts which have a max. frequency spread of up to 1.5x due to manufacturing variations.

Obviously the testing needed for speed binning would increase production costs, but it would allow the silicon vendor to extract a sizable premium for the fast bins, which should more than make up for that.

The model in the GPU world seems to be that card vendors are doing unofficial binning by offering plain, superclocked, and super-superclocked variants. However I am wary of the process used to do this informal binning, I would assume that the GPU vendors with a knowledge of the chip internals have a better chance of accurately binning parts. I am aware that in the CPU world there have been instances of incorrectly binned parts early in the production of new CPUs due to a missed speed path, so even chip vendor-based binning is necessarily perfect.

I got burned by this a few years ago!

I bought a flagship super-uber-clocked GTX ███ from board partner ████ and it would almost always run one of the standard gaming benchmarks without displaying any artifacts.

Yet one of my CUDA benchmarks failed with errors at almost every clock speed.

Even worse, I returned it and the refurb the partner sent me was also flaky! It was really distressing. Was my code buggy? No, no, not possible! =)

The unnamed vendor became quiet but eventually accepted their junky refurb and reimbursed me.

I went to Newegg and bought a rather-average-clocked model… and it not only worked perfectly but it could be OC’d like crazy.

You can imagine my relief.

\o/

I’m fine with variations. In my view that’s an expected result of chip manufacturing technology. Many people may not think the same way, though.

What I find strange is that in CPUs frequency scaling seems to have been figured out quite well: it’s fine-grained, quick, quite uniform across samples of the same chip model, and all in all it seems to cause little trouble (at least on the desktop). In contrast, I get the impression that GPUs are somewhat behind. Application clocks appeared late, sometime around the GK110 (but I don’t think the first drivers even supported it)and it was/is all manual. At the same time, the recently introduced “auto-boost” seems to be just a little bit more refined; e.g. it’s quite slow in contrast to the frequency scaling I observed on CPUs.

Can somebody shed some light on the reasons? Is it simply because the code that runs on CPUs rarely scales as desktop applications often just do not have enough concurrency and because devs have been reluctant (and often incompetent) to write parallel code. So CPU vendors had to step in an develop a “fix” for this?