Very fast ramp-down from high to low clock speeds leading to increased time repeatedly ramping up

re: [Very(!) slow ramp down from high to low clock speeds leading to a significantly increased power cons - #144 by amrits]

I have a related but opposite issue in Windows 10, using Driver Version: 460.89 CUDA Version: 11.2. TCC.

I want the clock to stay fast for 30 seconds or so before ramping down. Is there a way to control this?

If not, what would be ideal would be some way to control the CPU clock ramp-down rate. Maybe an environment variable like CUDA_CLOCK_RAMPDOWN_SECS, which would default to whatever it is now, but allow users to control/tune it per application.

To the best of my knowledge, the automatic GPU power management has no user-settable knobs. Depending on your use case and your GPU, you may want to explore whether fixing application clocks through nvidia-smi allows you to approximate what you want. Relevant command line switches are --applications-clocks and --lock-gpu-clocks.

1 Like

Those look like good suggestions. Digging into nvidia-smi more, I see those aren’t supported on my (pascal) Quadro P1000, alas. But maybe I can keep the clock rate high another way?

I’m using CUDA in my C# Emgu (OpenCV) application to run a caffenet neural net which processes images. I’m noticing that I only get the expected frame rate / throughput if I keep it warmed up. If I run it (usual neural net forward -type operation) twice in succession, the second run gives me the expected timings (~100ms/frame). but the first run is slower (~500ms/frame). (This does not include time spent loading the model which is done separately.) If this were an application that continuously processed frames, then this would not be an issue. However, in my use case, I’m doing this periodically (once every 30-120 seconds or so), but still want it to be fast, even for single frame processing.

Maybe a warm-up kernel would help, but, I’m somewhat at the mercy of the APIs which I am using, which don’t expose the underlying CUDA contexts that would be needed for this.

What exactly happened when you tried to set the application clocks?

I just tried --application-clocks on a Quadro P2000 under Windows 10 with the normal WDDM driver, and nvidia-smi had no complaints when I tried to set the application clocks. Note that nvidia-smi requires administrative permissions to set clocks, so you may need to run nvidia-smi -acp UNRESTRICTED from an administrator command prompt first. Make sure to specify valid clock values, a list of which you can retrieve with nvidia-smi -q -d SUPPORTED_CLOCKS. For example, the Quadro P2000 seems to support only one memory clock value: 3504 MHz.

When I use nvidia-smi -i 0 -ac 3504,1721 the response from nvidia-smi is Applications clocks set to "(MEM 3504, SM 1721)" for GPU 00000000:17:00.0.

I can’t exactly make sense of your observations. When I run with very short kernels, sufficiently spaced apart, my GPU seems to operate at about half the highest available clock rate. Whether this is for real or an artifact of how the GPU-Z utility samples the clocks I cannot say. Even if the clocks never increase to full work frequency on account of the short-running kernel, I would not expect this to cut performance by a factor of 5, as you state you observe.

1 Like

This man page for nvidia-smi nvidia-smi.txt led me to believe that was only for “Tesla devices from the Kepler+ family and Maxwell-based GeForce Titan.”

I just tried nvidia-smi -ac 2505,1544 And I get no complaints from nvidia-smi. However, I don’t see any change looking at nvidia-smi dmon: the speeds are as low as before. As before, the clocks do ramp up when I run the net, (then ramp down seconds later), but I see no difference in my application timings.

What is odd to me is that as a test, I am able to get the high speeds if I run the model (neural net forward operation) twice on a test image, the second and subsequent times are the fast ones. The first one is slow.

It may be my test isn’t valid, in that I am using the same test image multiple times. The next net forward operation (not a test) uses a different input image and that may be confounding my apparent timings.

Did you try in conjunction with -lgc (--lock-gpu-clocks)? From my observations, it definitely seems to be the case that application clocks are not maintained if there is throttling, e.g. thermal-based or power-based. Which makes sense to me. If that applies, you may want to experiment with lower application clocks.

Since you are on Windows, I would definitely recommend trying TechPowerUp’s GPU-Z utility for continuous and convenient graphical monitoring. It allows the setting of sample intervals down to 0.1 seconds, which I would not recommend for regular monitoring as it probably creates high overhead.

The behavior of your app is not clear to me, nor is the performance measurement methodology. When software interacts with hardware, most of the time there is a startup overhead for the first run, due to numerous initialization and caching effects, starting with disk buffers up to processor caches. CUDA accelerated applications experience a delay at the start for CUDA context creation before the first kernel runs. This overhead will not apply to subsequent kernels. The context creation delay increases with increasing amount of memory in the system (from host and device) and decreasing amount of single-thread CPU performance.

1 Like

Unfortunately, --lock-gpu-clocks isn’t supported for this card:

> nvidia-smi --lock-gpu-clocks=2505,2505
Setting locked GPU clocks is not supported for GPU 00000000:03:00.0.
Treating as warning and moving on.
All done.

My performance measurement is bracketing the neural net operations I want to measure (SetInput,Forward) with a C# stopwatch start/stop to get the elapsed time. When I do this I get the expected, faster timings only if I repeat that test, and look at the second and subsequent timings.

I’ll have a look at TechPowerUp’s GPU-Z and see what is happening as I run my CUDA application.

Thanks so much for the help!

Yeah, I get that same status message when I try to lock the GPU clock on my Quadro P2000.

When you refer to first and second run, are you referring to two separate runs of the C# app (i.e launch app from command line twice), or to two invocations of a CUDA-accelerated function within the same run of the C# app? If it is the latter, my comments regarding CUDA startup overhead likely apply, meaning the observed performance differences may not actually be related to GPU clocks. I have not insight into C# bindings for CUDA or how the C# stopwatch works, though.

1 Like

It must be something besides GPU clock speeds per se.

When I look at GPU-Z running at .1 sec sample intervals, I can see the clocks stay at top speed for a few seconds. If I run my test again while GPU-Z shows the clocks running at top speed, I get the same timings. So that now leads me to believe that the CUDA clock speed is fine, and I am doing something else to cause what I am seeing.

By “second run” here, I mean invoking the same test function multiple times, all within the same run of the C# app.

When doing timing from within CUDA apps, a common trick is to issue a call cudaFree(0) prior to the timed portion of the application. The call will trigger CUDA context creation and the associated overhead, which therefore does not impact the subsequent kernel launches in the timed portion.

As I said, I don’t know anything about C# bindings for CUDA, so I cannot advise on how to replicate this technique in a C# environment.

1 Like

The cudaFree(0) idea looks good, I was thinking along those lines Disappointing performance with yolo, what is wrong? - Emgu CV: OpenCV in .NET (C#, VB, C++ and more) and I am hoping I can find a way to invoke cudaFree from Emgu.
Thanks for your help!

To maintain the clock when the GPU is idle, you would need persistence mode enabled and also set the application clocks to a proper level.

That should generally work on datacenter GPUs. I don’t know if any of it is supported on your GPU.

That won’t necessarily force the GPU to run at elevated clocks when it is idle, but it should cause the GPU to immediately reach those clocks as soon as a cuda context is established on the GPU, and maintain those clocks for the duration of cuda context, excepting if there are power or thermal issues (discoverable with nvidia-smi) which cause a change in behavior.

I thought the driver is always persistent under Windows (which is what OP uses)? Or is there a difference when the TCC driver is used?

Sorry you are correct. Even in TCC mode, persistence mode is not configurable on windows.