Maybe i choose not right forum, but I speak in terms of nvidia card…i have Tesla T4 card and it must be some tool to limit its max load. Maybe via max power? But its a bit cheaty way i think. I was sure that there is some more beautiful way. If there is no any way to control load, so then Whisper code writers even more not capable to control that load…
I don’t follow your logic. A GPU has a particular capability. Presumably whisper has a particular workload that uses a percentage of that capability. If I limit power usage, or otherwise reduce the GPU capability, and the workload remains the same, the percentage utilization of the GPU will go up, not down. To make the GPU percentage utilization go down, I would need to be able to add more capability to the GPU, not reduce its capability.
The only suggestion I have is to reduce the workload, or find a GPU that has more capability. I’m unlikely to be able to offer further suggestions or respond further. It’s OK if we disagree, its the nature of community.
I am confused. Real-time speech-to-text conversion is a problem that was successfully tackled by single-core CPUs using early SIMD instruction sets 20 years ago. How is this soaking up 90% of a (fairly) modern GPU?
Does whisper convert multiple streams simultaneously? Does it convert faster than real time for offline conversions? If either of those two apply, try reducing the number of simultaneous streams or the conversion speed. If there are quality settings, try reducing those.
BTW, can whisper reliably distinguish between “how to recognize speech” and “how to wreck a nice beach”?
If the quoted numbers are for a single stream of speech-to-text conversion, that application seems just insane.
If the app has no knobs to turn down its resource usage, you best bet is to use the latest and most powerful GPU hardware (T4 is a slightly older middle-of-the-road kind of GPU; try something high-end based on the Ampere architecture), with the caveat that the app may be designed to automatically expand its resource usage if more resources are available.
Alternatively, use a different app to transcribe speech to text.
“Very low” does not mean much. The somewhat humorous test case I quoted is an often cited one that has been around for at least thirty years (which is when I came across it), and that caused problems for the speech-to-text conversion available at the time.
It looks like you may be able to reduce the processing load by specifying something like --model small?
I think the ASR SOTA has changed over time. The stuff being referred to “20 years ago” as a “solved problem” I think had one or both of these characteristics:
limited vocabulary
not speaker independent
Modern ASR (such as what is in riva) is both speaker independent and large vocabulary, and can even be multilingual. Even though I can describe those things in single bullet points, they substantially increase the complexity of the algorithm, or stated differently, the computational load, to achieve a particular accuracy level.
It certainly makes a difference what the exact need is. Transcribing scripted TV shows with professional actors, versus transcribing an episode of the Jerry Springer Show, versus transcribing the Nixon tapes.
I’ll grant you that the products of twenty years ago would not have been able to transcribe most of the Nixon tapes accurately (even humans have to make multiple passes over many of them to accomplish that), they were not multilingual in any one configuration, and their monolingual capabilities were limited to a handful of major European and Asian languages.
OP has not described their use case. Maybe they need the ability to transcribe poor quality recordings and require simultaneous support for multiple languages because they need to be able to handle the South African national anthem in one go. In which case use of the large model (whatever that is) may be justified and indicated, and they should just get the biggest and baddest GPU available. Or maybe they just need to transcribe the 10 o’clock news in real time, in which case a less ambitious configuration of the whisper app might suffice.
We dont discuss whisper soft here, i mentioned it just to clarify what exactly i do, i want to understand is there any way to get load SM/MEM usage down:
-via some kind of api or settings of nvidia
-or it is task for whisper coders (but anyway i suppose they have to use some nvidia tools to set it)
P.s. attempted to set max power:
nvidia-smi -pl --power-limit= Specifies maximum power management limit in watts.
root@0f0526c3863d:/data# nvidia-smi --power-limit=50
Provided power limit 50.00 W is not a valid power limit which should be between 60.00 W and 70.00 W for GPU 00000000:00:05.0
Terminating early due to previous errors.
root@0f0526c3863d:/data# nvidia-smi --power-limit=60
Failed to set power management limit for GPU 00000000:00:05.0: Insufficient Permissions
Terminating early due to previous errors.
Dont get it why “Insufficient Permissions”
root@0f0526c3863d:/data# nvidia-smi -q -d POWER
==============NVSMI LOG==============
Timestamp : Thu Jul 13 08:57:48 2023
Driver Version : 525.105.17
CUDA Version : 12.1
Attached GPUs : 1
GPU 00000000:00:05.0
Power Readings
Power Management : Supported
Power Draw : 27.28 W
Power Limit : 70.00 W
Default Power Limit : 70.00 W
Enforced Power Limit : 70.00 W
Min Power Limit : 60.00 W
Max Power Limit : 70.00 W
Power Samples
Duration : 29096.54 sec
Number of Samples : 119
Max : 30.56 W
Min : 16.79 W
Avg : 22.45 W
@Robert_Crovella already pointed out at the start of the thread what needs to be done in order to reduce the average GPU load from 90%-100% to 50% to 70%: Either (1) use a faster GPU, or (2) reduce the amount of work submitted to the GPU by configuring the app accordingly.
Something what can help with item (2) is to lower the priority of the host thread that submits the work to the GPU. Note that this is neither guaranteed to achieve any particular GPU load factor, nor is it typically suitable for online work, e.g. real-time close captioning in the case of speech-to-text. This method can also drastically reduce achieved application-level throughput depending on what else is going on in the machine. This method is therefore most suitable for long-running background work without a fixed deadline (e.g. BOINC).
nvidia-smi requires administrative privileges for various hardware settings that can impact other users or applications. Try setting the power limit when in possession of administrative privileges, e.g. via sudo. Note that setting a lower power limit is not going to reduce the GPU load factor. A lower power limit can lead to reduced GPU performance and thus a higher load factor. It can obviously reduce the application-level throughput (i.e. seconds of speech processed per wall clock time elapsed). The relationship is not linear and you may well achieve higher energy efficiency (seconds of speech processed per Watt-hour).
New hardware is not an option, we have alredy nvidia hardware and its power more then enough for us. From this i make a conclusion that if nvidia have no such capabilities, then whisper coders probably have to make load tunable
I will try to look on this option, but it also cheaty way. I thought that on forum some nvidia guy will come and says “we dont have such tool to tune load, we will add it”, oh dreams
You can certainly file a feature request (e.g. for kernel launch throttling at a specific rate) with NVIDIA. Use the bug reporting form, just make sure to indicate it’s a feature request. Much in the development of the CUDA eco system was and is actually driven by customer requests. However, adding functionality of any kind comes with NRE and maintenance costs.
The more customers request a particular feature, and the more additional revenue is likely going to result from implementation of that feature, the higher the likelihood of implementation.
It was necessary to set limit not from container, but from host machine:
[root@server-temp ]# nvidia-smi --persistence-mode=1
Enabled persistence mode for GPU 00000000:00:05.0.
All done.
[root@server-temp ]# nvidia-smi --power-limit=60
Power limit for GPU 00000000:00:05.0 was set to 60.00 W from 70.00 W.
All done.