Stability Issues with GPU Inference on Older GPUs (e.g., 1080Ti)

Description

Hello
I’m facing an issue with older GPU , such as the 1080Ti, where I am unable to use ‘nvidia-smi -lmc -lgc’ to lock GPU and memory Clocks because the graphics card itself does not support it. This results in significant fluctuations in the inference time of my looped model. The automatic downclocking that occurs introduces substantial delays with each downclocking event.

I’ve attempted to address this by writing a simple kernel function that runs every 10 microseconds (it will not be effect > 0.1ms), effectively preventing automatic downclocking. However, this approach comes at the cost of increased SM utilization, leading to longer model inference times. Moreover, it seems unconventional.

I’m seeking advice on how to achieve stable model inference on older GPUs, such as locking clock frequencies. Any suggestions or alternative methods would be greatly appreciated.

Thank you!

Environment

GPU Type: 1080Ti
Nvidia Driver Version: 537.13
CUDA Version: 12.2
Operating System + Version: windows 10

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Does this GPU support application clocks (-ac)? If so, you might want to try that.

That does not sound right. I could not tell you how fast the power management software reacts to load changes, but it is probably more on the order of a millisecond, so you would need to fire off a kernel at most once every millisecond. The shortest kernel launch time I have seen was 3 microseconds, so issuing a kernel every 10 nanoseconds seems not possible.

Well, your goal is unconventional by trying to achieve functionality that is simply not supported on this consumer card, as you established.

I assume you have verified that the reason for down-clocking the GPU is that the GPU idles rather than a thermal issue? I would expect an GPU-accelerated app to show symptoms of heavy GPU load, rather than exhibit issues with light GPU load.

Does the application have configuration switches you can use to issue work to the GPU in rapid-fire manner, rather than with substantial gaps in between GPU work? Can you move the GPU to a faster host system? I usually recommend CPUs with a base clock of >= 3.5 GHz to reduce the likelihood of application performance being limited by host performance.

Nsight Compute allows clock locking on these cards but the last version to support the 1080, is 2019.5, which ships with Cuda Toolkit 10.2 or may be seperately downloaded here.

Note that it locks the clock to the base clock frequency.

That said, I’m not quite sure whether you are just wanting to measure the kernel duration or if you need to lock the clock for normal operation.

This was my mistake, it should be 10 microseconds rather than 10 nanoseconds. When the setting exceeds 100 microseconds, the GPU frequency will often automatically decrease. And when it is below 10 microseconds, the SM will be continuously occupied by the kernel, which makes it difficult for other programs to execute normally (as far as I know, the SM utilization rate refers to the occupancy of time slices, correct?).

This might not be a thermal issue. I have monitored the temperature using nvidia-smi, and it remains within the range of 40 to 65 °C. Moreover, when I do not initiate a simple kernel thread, there are fluctuations in the timing right from the beginning.

However, after starting a loop 10 microsecond running a kernel, the inference becomes relatively stable with occasional fluctuations that do not significantly impact performance.

The above experiment was on a 3060 with the aim of exploring alternative solutions replace clock locking, and the results are consistent on 1080Ti.

Additionally, an interesting discovery has just been made: under the 472 Nvidia driver and TensorRT 7 environment, there are no fluctuations in latency observed on the 1080Ti.

Thank you very much for your advice.

My intention is to lock the clock in order to maintain my CUDA and TensorRT code at a consistently fast inference state, avoiding instances of drastic increases in execution time, due to automatic underclocking.

I will try the 2019.5 version you mentioned later. However, I am uncertain if it is compatible with the newer versions of CUDA 12.x and TensorRT 8.6+.

The issue is clearer to me now, with the nvidia-smi output. I’m not actually sure that even if you could lock the clocks, whether that is honoured in a situation where the GPU down clocks in a low usage scenario - the power saving feature is quite aggressive.

Certainly it’s possible that there may be issues around at least newer drivers - see here.

1 Like

Thank you for your prompt response.

I am currently testing the clock-locking feature of 2019.5, as well as its impact on my CUDA code.

However, are there any other methods to achieve a similar effect—ensuring stable and fast execution of CUDA/TensorRT code programs without necessitating major CUDA Apis changes?

Is there a standalone tool available for clock locking that does not require changing the CUDA version?
Regarding my run a loop kernel approach, is there a more efficient solution that can continuously activate the GPU while minimizing its impact on other CUDA programs?

In summary, my main goal is to ensure that the CUDA program runs stably. Initial findings that this instability is due to GPU underclocking.
The test results indicate that clock locking completely resolves this issue, or running a 10 microsecond loop kernel function can mitigate the occurrence of underclocking.

Looking forward to your response.

I’ve no experience with Cuda programs where the GPU is underutilised to this degree.
While it seems GPUs later than the 10XX have clock locking capabilties, you don’t seem to have that option.

This issue has come up from time to time - here for example, but like you have found, it appears that a minimum level of “busy-ness” is required.

Thank you for your response.
After reviewing the provided materials, I’ve concluded that keeping the GPU constantly busy is indeed an effective method, which aligns with my previous approach.

we use a simple kernel to keep GPU busy so that we can get a better and steady performance.

However, In his case, there are considerable waiting periods between multiple GPU computations, causing the GPU to be reactivated each time, significantly increasing the overall latency.
In contrast, there’s a difference in my case: the waiting time between multiple GPU computations is relatively short, within 50ms, meaning the GPU remains active throughout.
My speculation is that, perhaps, the GPU deems the current tasks do not require such high clock speeds and thus automatically underclocks to reduce power consumption.
Hence, I need to set up kernels running within 10 microseconds (=10,000 ns = 0.01 ms) to increase the GPU’s clock high frequency, but this does affect its performance on other computations.

Keep GPU busy at a 10-microsecond interval might help, but it adds to the execution time of other CUDA codes. Is whether there are alternative methods to directly lock the frequency without resorting to these workarounds? For instance, could it be done by modifying configurations files or using a standalone clock-locking tool that doesn’t depend on the CUDA version?
Alternatively, is there a more light GPU program (not a 10-microsecond loop kernel) that can dynamically boost the GPU computation frequency in real-time while minimizing the impact on other CUDA calculations?

NVIDIA keeps the details of its dynamic clock control mechanisms proprietary. They can and do change without notice. There are no user-controllable knobs for this. Application clocks and clock locking are restricted features that are part of market segmentation.

I have not come across any information indicating that someone has reverse engineered and bypassed NVIDIA clock control mechanisms. Given that (based on historical observation) the details seem to change frequently, any such hack, if it existed, would likely be very brittle.

“Correct” solutions:

(1) Use hardware that supports clock locking
(2) When crafting GPU-accelerated applications, bundle the GPU work so it occurs in tight temporal neighborhoods, minimizing intermittent idling. The point of using GPU acceleration is typically to keep the GPU as busy as possible; much GPU idling would suggest to me poor use of acceleration.

You can try MSI Afterburner. No guarantees of course. Since your card in question is a GeForce card on windows, there are numerous postings of people asking to lock the clocks for graphics purposes. A bit of google searching will turn up many.

Thank you for your reply,

In the older product lines, the GTX 1080 has not yet been fully phased out, and it’s not just the 1080; a number of GPUs ,e.g., 10x and 20x series are unable to run programs stably under the latest drivers. (They can run stable with drivers from the 4xx for example 472 and using Cuda11.6 + TensorRT 7. not
cuda 12 + trt8).We cannot replace all graphics cards with advanced GPUS which support clock locking. In certain scenarios, using lower-end GPUs presents a higher cost-performance ratio.

Thank you for your advice , but my tests have shown that , only when CUDA computations are continuously executed within a 50-microsecond interval, the GPU maintains its high clock frequency

The GPU’s automatic downclocking occurs quite rapidly. In my model inference, computations occur at 100 Hz (every 10 milliseconds), which should represent a relatively dense and reasonable computation frequency, But the 100HZ is not enough.

I even tried using a loop like this: while(1){ cudaDeviceSynchronize(); run_kernel<<<1,1>>>(); } with the intention of keeping the GPU busy by continuously executing a kernel in parallel as soon as all other CUDA programs had finished. However, I found that this approach was less effective than running the kernel every 10 microseconds.

Thank you for your suggestions.
In fact, we’ve tried a variety of third-party overclocking tools, including but not limited to the one offered by MSI tools.

the msi-afterburner is also not support old GPUs… it fail lock the 1080ti clock

In addition, the unofficial lock frequency is highly unstable and tends to significantly impact the system (often causing Windows crashes).

Moreover, it fails to consistently maintain the locked frequency, occasionally resulting in downclocking. On the contrary, nvidia-smi proves to be very effective in reliably locking the clock frequency…

Thank you for the responses from the professionals these past days. I have now gained a general understanding and summarized several key points.

  1. Newer graphics cards are the ones that support clock throttling control (or, to be more precise, newer drivers provide frequency locking features for newer GPUs ,but not support the old GPUs).

  2. To achieve higher GPU CLOCKS, it is necessary to ensure the GPU remains under a sufficient workload.( for example , GPU work in 10us )

  3. On older GPUs, there are currently no reliable solutions for stable clock locking.

I initially wrote an answer (now deleted) based on misreading 50 microseconds as 50 milliseconds. If the 50 microseconds are accurate, this seems unexpectedly rapid control. Not sure what is going on with that.

Yes, it requires a kernel that executes within 50 microseconds (us) for better stability. I’ve set it to run every 10 microseconds (10us = 10,000 ns = 0.01 ms), which seems to provide more reliable results, as exceeding 50 microseconds significantly increases the downclocking.

I wrote a simple demo where no additional CUDA computations were included, and it appeared to demand a higher frequency. However, running at 10us intervals didn’t guarantee stable high clock ; instead, a 1 microsecond interval managed to achieve this consistently.

When there are other CUDA programs running concurrently or models performing inference, this threshold can be set to a large value ( 10~50us ) and still have a positive effec