Limit SM speed

kkuznetsov1 · July 12, 2023, 2:41pm

Hi! I use whisper to transcribe audio to text. But when i launch

whisper 35-audio.aac --model large --language en --output_format vtt

So SM and MEM are loaded for 90-100%. Is there any way to get load down to optimal 50-70%?

It would be better if ENC and DEC will not be affected…

Robert_Crovella · July 12, 2023, 3:08pm

There isn’t any way to do that using CUDA. I don’t think you’ll find many whisper experts on this forum.

kkuznetsov1 · July 12, 2023, 6:09pm

Maybe i choose not right forum, but I speak in terms of nvidia card…i have Tesla T4 card and it must be some tool to limit its max load. Maybe via max power? But its a bit cheaty way i think. I was sure that there is some more beautiful way. If there is no any way to control load, so then Whisper code writers even more not capable to control that load…

Robert_Crovella · July 12, 2023, 6:25pm

I don’t follow your logic. A GPU has a particular capability. Presumably whisper has a particular workload that uses a percentage of that capability. If I limit power usage, or otherwise reduce the GPU capability, and the workload remains the same, the percentage utilization of the GPU will go up, not down. To make the GPU percentage utilization go down, I would need to be able to add more capability to the GPU, not reduce its capability.

The only suggestion I have is to reduce the workload, or find a GPU that has more capability. I’m unlikely to be able to offer further suggestions or respond further. It’s OK if we disagree, its the nature of community.

njuffa · July 12, 2023, 6:49pm

I am confused. Real-time speech-to-text conversion is a problem that was successfully tackled by single-core CPUs using early SIMD instruction sets 20 years ago. How is this soaking up 90% of a (fairly) modern GPU?

Does whisper convert multiple streams simultaneously? Does it convert faster than real time for offline conversions? If either of those two apply, try reducing the number of simultaneous streams or the conversion speed. If there are quality settings, try reducing those.

BTW, can whisper reliably distinguish between “how to recognize speech” and “how to wreck a nice beach”?

kkuznetsov1 · July 12, 2023, 8:06pm

On my cpu it takes x more times (1:47) to transcribe, and on gpu it is 1:1

kkuznetsov1 · July 12, 2023, 8:08pm

i noticed that when SM is 100% then power wattage goes maximum 70W and temperature goes to maximum 85C.

That one of the reasons to get low load…

kkuznetsov1 · July 12, 2023, 8:09pm

no simultaneous, just one stream, but very very good)

kkuznetsov1 · July 12, 2023, 8:10pm

On cases which i saw with --model large percentage of mistakes is very very low.

njuffa · July 12, 2023, 8:14pm

If the quoted numbers are for a single stream of speech-to-text conversion, that application seems just insane.

If the app has no knobs to turn down its resource usage, you best bet is to use the latest and most powerful GPU hardware (T4 is a slightly older middle-of-the-road kind of GPU; try something high-end based on the Ampere architecture), with the caveat that the app may be designed to automatically expand its resource usage if more resources are available.

Alternatively, use a different app to transcribe speech to text.

njuffa · July 12, 2023, 8:18pm

“Very low” does not mean much. The somewhat humorous test case I quoted is an often cited one that has been around for at least thirty years (which is when I came across it), and that caused problems for the speech-to-text conversion available at the time.

It looks like you may be able to reduce the processing load by specifying something like --model small?

Robert_Crovella · July 13, 2023, 2:29am

I think the ASR SOTA has changed over time. The stuff being referred to “20 years ago” as a “solved problem” I think had one or both of these characteristics:

limited vocabulary
not speaker independent

Modern ASR (such as what is in riva) is both speaker independent and large vocabulary, and can even be multilingual. Even though I can describe those things in single bullet points, they substantially increase the complexity of the algorithm, or stated differently, the computational load, to achieve a particular accuracy level.

njuffa · July 13, 2023, 3:22am

It certainly makes a difference what the exact need is. Transcribing scripted TV shows with professional actors, versus transcribing an episode of the Jerry Springer Show, versus transcribing the Nixon tapes.

I’ll grant you that the products of twenty years ago would not have been able to transcribe most of the Nixon tapes accurately (even humans have to make multiple passes over many of them to accomplish that), they were not multilingual in any one configuration, and their monolingual capabilities were limited to a handful of major European and Asian languages.

OP has not described their use case. Maybe they need the ability to transcribe poor quality recordings and require simultaneous support for multiple languages because they need to be able to handle the South African national anthem in one go. In which case use of the large model (whatever that is) may be justified and indicated, and they should just get the biggest and baddest GPU available. Or maybe they just need to transcribe the 10 o’clock news in real time, in which case a less ambitious configuration of the whisper app might suffice.

kkuznetsov1 · July 13, 2023, 5:38am

Got it, but thats why i mentioned “On cases which i saw”

Nope, small is not very sensitive and get more rude mistakes

Nope, those options not good for me, i dont have capability to use another hardware and this soft whisper do good transcribe task.

kkuznetsov1 · July 13, 2023, 5:46am

We dont discuss whisper soft here, i mentioned it just to clarify what exactly i do, i want to understand is there any way to get load SM/MEM usage down:
-via some kind of api or settings of nvidia
-or it is task for whisper coders (but anyway i suppose they have to use some nvidia tools to set it)

P.s. attempted to set max power:

nvidia-smi     -pl   --power-limit=        Specifies maximum power management limit in watts.

root@0f0526c3863d:/data# nvidia-smi --power-limit=50
Provided power limit 50.00 W is not a valid power limit which should be between 60.00 W and 70.00 W for GPU 00000000:00:05.0
Terminating early due to previous errors.

root@0f0526c3863d:/data# nvidia-smi --power-limit=60
Failed to set power management limit for GPU 00000000:00:05.0: Insufficient Permissions
Terminating early due to previous errors.

Dont get it why “Insufficient Permissions”

root@0f0526c3863d:/data# nvidia-smi -q -d POWER

==============NVSMI LOG==============

Timestamp                                 : Thu Jul 13 08:57:48 2023
Driver Version                            : 525.105.17
CUDA Version                              : 12.1

Attached GPUs                             : 1
GPU 00000000:00:05.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 27.28 W
        Power Limit                       : 70.00 W
        Default Power Limit               : 70.00 W
        Enforced Power Limit              : 70.00 W
        Min Power Limit                   : 60.00 W
        Max Power Limit                   : 70.00 W
    Power Samples
        Duration                          : 29096.54 sec
        Number of Samples                 : 119
        Max                               : 30.56 W
        Min                               : 16.79 W
        Avg                               : 22.45 W

njuffa · July 13, 2023, 6:20am

@Robert_Crovella already pointed out at the start of the thread what needs to be done in order to reduce the average GPU load from 90%-100% to 50% to 70%: Either (1) use a faster GPU, or (2) reduce the amount of work submitted to the GPU by configuring the app accordingly.

Something what can help with item (2) is to lower the priority of the host thread that submits the work to the GPU. Note that this is neither guaranteed to achieve any particular GPU load factor, nor is it typically suitable for online work, e.g. real-time close captioning in the case of speech-to-text. This method can also drastically reduce achieved application-level throughput depending on what else is going on in the machine. This method is therefore most suitable for long-running background work without a fixed deadline (e.g. BOINC).

nvidia-smi requires administrative privileges for various hardware settings that can impact other users or applications. Try setting the power limit when in possession of administrative privileges, e.g. via sudo. Note that setting a lower power limit is not going to reduce the GPU load factor. A lower power limit can lead to reduced GPU performance and thus a higher load factor. It can obviously reduce the application-level throughput (i.e. seconds of speech processed per wall clock time elapsed). The relationship is not linear and you may well achieve higher energy efficiency (seconds of speech processed per Watt-hour).

kkuznetsov1 · July 13, 2023, 7:07am

New hardware is not an option, we have alredy nvidia hardware and its power more then enough for us. From this i make a conclusion that if nvidia have no such capabilities, then whisper coders probably have to make load tunable

I will try to look on this option, but it also cheaty way. I thought that on forum some nvidia guy will come and says “we dont have such tool to tune load, we will add it”, oh dreams

it is root already…

njuffa · July 13, 2023, 7:29am

You can certainly file a feature request (e.g. for kernel launch throttling at a specific rate) with NVIDIA. Use the bug reporting form, just make sure to indicate it’s a feature request. Much in the development of the CUDA eco system was and is actually driven by customer requests. However, adding functionality of any kind comes with NRE and maintenance costs.

The more customers request a particular feature, and the more additional revenue is likely going to result from implementation of that feature, the higher the likelihood of implementation.

kkuznetsov1 · July 13, 2023, 7:35am

It was necessary to set limit not from container, but from host machine:


[root@server-temp ]# nvidia-smi --persistence-mode=1
Enabled persistence mode for GPU 00000000:00:05.0.
All done.
[root@server-temp ]# nvidia-smi --power-limit=60
Power limit for GPU 00000000:00:05.0 was set to 60.00 W from 70.00 W.
All done.

At least i dont get overtemperaturing now…

I will add feature request, thanks.

billy17 · July 16, 2023, 4:13pm

Why is that?

What is a better place to find experts on Whisper?

Topic		Replies	Views
Establishing GPU processor and memory usage CUDA Programming and Performance	26	18924	November 17, 2008
Quadro RTX 8000 Multi-GPU Performance Issue CUDA Programming and Performance	13	1173	March 8, 2025
SM Clock on RTX A6000 never reaches max frequency CUDA Programming and Performance nvidia-smi	4	5219	February 18, 2022
How to get the cuda "first-call overhead" to happen only once for cuda called from dll? CUDA Programming and Performance	51	307	November 25, 2024
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2366	January 9, 2017
Can a CUDA kernel read "mapped, pinned" host memory through a "Device Pointer"? CUDA Programming and Performance	10	2831	November 20, 2012
Long delays on CUDA app startup causing Nsight System to fail on startup CUDA Programming and Performance	37	1897	May 19, 2023
300x to 600x times faster... really? CUDA Programming and Performance	92	34413	February 8, 2010
CUDA very slow performance CUDA Programming and Performance	21	16737	March 6, 2020
Cuda code performance CUDA Programming and Performance	14	3149	December 16, 2014

Limit SM speed

Related topics