Very fast ramp-down from high to low clock speeds leading to increased time repeatedly ramping up

vision1 · January 15, 2021, 2:02am

re: [Very(!) slow ramp down from high to low clock speeds leading to a significantly increased power cons - #144 by amrits]

I have a related but opposite issue in Windows 10, using Driver Version: 460.89 CUDA Version: 11.2. TCC.

I want the clock to stay fast for 30 seconds or so before ramping down. Is there a way to control this?

If not, what would be ideal would be some way to control the CPU clock ramp-down rate. Maybe an environment variable like CUDA_CLOCK_RAMPDOWN_SECS, which would default to whatever it is now, but allow users to control/tune it per application.

njuffa · January 15, 2021, 2:38am

To the best of my knowledge, the automatic GPU power management has no user-settable knobs. Depending on your use case and your GPU, you may want to explore whether fixing application clocks through nvidia-smi allows you to approximate what you want. Relevant command line switches are --applications-clocks and --lock-gpu-clocks.

vision1 · January 15, 2021, 6:07pm

Those look like good suggestions. Digging into nvidia-smi more, I see those aren’t supported on my (pascal) Quadro P1000, alas. But maybe I can keep the clock rate high another way?

I’m using CUDA in my C# Emgu (OpenCV) application to run a caffenet neural net which processes images. I’m noticing that I only get the expected frame rate / throughput if I keep it warmed up. If I run it (usual neural net forward -type operation) twice in succession, the second run gives me the expected timings (~100ms/frame). but the first run is slower (~500ms/frame). (This does not include time spent loading the model which is done separately.) If this were an application that continuously processed frames, then this would not be an issue. However, in my use case, I’m doing this periodically (once every 30-120 seconds or so), but still want it to be fast, even for single frame processing.

Maybe a warm-up kernel would help, but, I’m somewhat at the mercy of the APIs which I am using, which don’t expose the underlying CUDA contexts that would be needed for this.

njuffa · January 15, 2021, 7:13pm

What exactly happened when you tried to set the application clocks?

I just tried --application-clocks on a Quadro P2000 under Windows 10 with the normal WDDM driver, and nvidia-smi had no complaints when I tried to set the application clocks. Note that nvidia-smi requires administrative permissions to set clocks, so you may need to run nvidia-smi -acp UNRESTRICTED from an administrator command prompt first. Make sure to specify valid clock values, a list of which you can retrieve with nvidia-smi -q -d SUPPORTED_CLOCKS. For example, the Quadro P2000 seems to support only one memory clock value: 3504 MHz.

When I use nvidia-smi -i 0 -ac 3504,1721 the response from nvidia-smi is Applications clocks set to "(MEM 3504, SM 1721)" for GPU 00000000:17:00.0.

I can’t exactly make sense of your observations. When I run with very short kernels, sufficiently spaced apart, my GPU seems to operate at about half the highest available clock rate. Whether this is for real or an artifact of how the GPU-Z utility samples the clocks I cannot say. Even if the clocks never increase to full work frequency on account of the short-running kernel, I would not expect this to cut performance by a factor of 5, as you state you observe.

vision1 · January 15, 2021, 8:21pm

This man page for nvidia-smi nvidia-smi.txt led me to believe that was only for “Tesla devices from the Kepler+ family and Maxwell-based GeForce Titan.”

I just tried nvidia-smi -ac 2505,1544 And I get no complaints from nvidia-smi. However, I don’t see any change looking at nvidia-smi dmon: the speeds are as low as before. As before, the clocks do ramp up when I run the net, (then ramp down seconds later), but I see no difference in my application timings.

What is odd to me is that as a test, I am able to get the high speeds if I run the model (neural net forward operation) twice on a test image, the second and subsequent times are the fast ones. The first one is slow.

It may be my test isn’t valid, in that I am using the same test image multiple times. The next net forward operation (not a test) uses a different input image and that may be confounding my apparent timings.

njuffa · January 15, 2021, 8:37pm

Did you try in conjunction with -lgc (--lock-gpu-clocks)? From my observations, it definitely seems to be the case that application clocks are not maintained if there is throttling, e.g. thermal-based or power-based. Which makes sense to me. If that applies, you may want to experiment with lower application clocks.

Since you are on Windows, I would definitely recommend trying TechPowerUp’s GPU-Z utility for continuous and convenient graphical monitoring. It allows the setting of sample intervals down to 0.1 seconds, which I would not recommend for regular monitoring as it probably creates high overhead.

The behavior of your app is not clear to me, nor is the performance measurement methodology. When software interacts with hardware, most of the time there is a startup overhead for the first run, due to numerous initialization and caching effects, starting with disk buffers up to processor caches. CUDA accelerated applications experience a delay at the start for CUDA context creation before the first kernel runs. This overhead will not apply to subsequent kernels. The context creation delay increases with increasing amount of memory in the system (from host and device) and decreasing amount of single-thread CPU performance.

vision1 · January 15, 2021, 9:05pm

Unfortunately, --lock-gpu-clocks isn’t supported for this card:

> nvidia-smi --lock-gpu-clocks=2505,2505
Setting locked GPU clocks is not supported for GPU 00000000:03:00.0.
Treating as warning and moving on.
All done.

My performance measurement is bracketing the neural net operations I want to measure (SetInput,Forward) with a C# stopwatch start/stop to get the elapsed time. When I do this I get the expected, faster timings only if I repeat that test, and look at the second and subsequent timings.

I’ll have a look at TechPowerUp’s GPU-Z and see what is happening as I run my CUDA application.

Thanks so much for the help!

njuffa · January 15, 2021, 9:20pm

Yeah, I get that same status message when I try to lock the GPU clock on my Quadro P2000.

When you refer to first and second run, are you referring to two separate runs of the C# app (i.e launch app from command line twice), or to two invocations of a CUDA-accelerated function within the same run of the C# app? If it is the latter, my comments regarding CUDA startup overhead likely apply, meaning the observed performance differences may not actually be related to GPU clocks. I have not insight into C# bindings for CUDA or how the C# stopwatch works, though.

vision1 · January 15, 2021, 9:29pm

It must be something besides GPU clock speeds per se.

When I look at GPU-Z running at .1 sec sample intervals, I can see the clocks stay at top speed for a few seconds. If I run my test again while GPU-Z shows the clocks running at top speed, I get the same timings. So that now leads me to believe that the CUDA clock speed is fine, and I am doing something else to cause what I am seeing.

By “second run” here, I mean invoking the same test function multiple times, all within the same run of the C# app.

njuffa · January 15, 2021, 9:33pm

When doing timing from within CUDA apps, a common trick is to issue a call cudaFree(0) prior to the timed portion of the application. The call will trigger CUDA context creation and the associated overhead, which therefore does not impact the subsequent kernel launches in the timed portion.

As I said, I don’t know anything about C# bindings for CUDA, so I cannot advise on how to replicate this technique in a C# environment.

vision1 · January 15, 2021, 9:38pm

The cudaFree(0) idea looks good, I was thinking along those lines Disappointing performance with yolo, what is wrong? - Emgu CV: OpenCV in .NET (C#, VB, C++ and more) and I am hoping I can find a way to invoke cudaFree from Emgu.
Thanks for your help!

Robert_Crovella · January 15, 2021, 9:40pm

~~To maintain the clock when the GPU is idle, you would need persistence mode enabled and also set the application clocks to a proper level.~~

~~That should generally work on datacenter GPUs. I don’t know if any of it is supported on your GPU.~~

That won’t necessarily force the GPU to run at elevated clocks when it is idle, but it should cause the GPU to immediately reach those clocks as soon as a cuda context is established on the GPU, and maintain those clocks for the duration of cuda context, excepting if there are power or thermal issues (discoverable with nvidia-smi) which cause a change in behavior.

njuffa · January 15, 2021, 9:42pm

I thought the driver is always persistent under Windows (which is what OP uses)? Or is there a difference when the TCC driver is used?

Robert_Crovella · January 15, 2021, 9:46pm

Sorry you are correct. Even in TCC mode, persistence mode is not configurable on windows.

Topic		Replies	Views
Avoiding performance mode slowdown CUDA Programming and Performance	12	3835	October 18, 2017
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13323	July 9, 2008
Needing expert advice.. CUDA Programming and Performance	4	1266	July 21, 2014
How to get the cuda "first-call overhead" to happen only once for cuda called from dll? CUDA Programming and Performance	51	247	November 25, 2024
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1744	July 19, 2022
CUDA performance get slower after sleep in host side CUDA Programming and Performance	7	1157	November 22, 2022
Stability Issues with GPU Inference on Older GPUs (e.g., 1080Ti) CUDA Programming and Performance	15	1052	January 22, 2024
cudaMemcpy Hung CUDA Programming and Performance	21	4085	May 30, 2019
Problem with performance with different Visual/CUDA versions CUDA Programming and Performance	8	1448	November 22, 2015
Multiple users running CUDA WinXP CUDA Programming and Performance	22	6943	June 10, 2008

Very fast ramp-down from high to low clock speeds leading to increased time repeatedly ramping up

Related topics