What's the situation with vdpau/vaapi/nvdec?

amrits,

Great to see you looking at it. This is the infamous CUDA-forces-P2 behaviour, and because nvdec requires CUDA, it kicks in, and can’t be prevented. VDPAU avoids it, and now the Vulkan video decode avoids it too. Would certainly be great if there was a way to disable it with nvdec.

…or just a way use DMA-BUF with VDPAU. It would likely be a nicer solution than what we have now.

Technically, you could use Vulkan Video Decode to implement the vaapi driver, but it would be a horrific experience that would destroy your will to live. You’d basically have to use ffmpeg+libplacebo to do it to retain your sanity.

It also happens when you need the encoder (NVENC) as it also uses CUDA, e.g. when using the “GPU-Screen-Recorder”… This is especially bad because with higher-end NVIDIA GPUs, the effect of staying in P2 is worse than with lower-end cards (I suspect because they are generally higher clocked and have higher TDPs). This leads to a 3080 having the fans running constantly to cool it, while a 3070 only needs them occasionally… when “idling” with an low power CUDA task…

Still no fix for this?

Nvidia RTX 3080
Nvidia Drivers 550.67
Fedora 39, specifically Bazzite GNOME
Wayland + nvidia-vaapi-driver from elfarto

By watching a youtube video using VP9 @ 1080p in Firefox:

Using nvidia-smi dmon -i 0
From 37w idle, memory 405mhz, pclk 210mhz
To 112w, memory 9251mhz, pclk 1905mhz

That is, at least to my understading, a lot of power for just playing a simple video. It happens with any other codec as well. Using this page as a reference: HTML5 audio/video tester - File type player - MIME type tester

Tested MPV with local files. Same behavior with any codec, 1080p or 4k resolution files.

Apologies for the long response time on this. I’ve known about this issue, and I recently looked into it on previously mentioned bug 200504689.

We have just published the CUDA_DROP_TO_IDLE environmental variable which can be used to reduce power usage when using the encoders/decoders mentioned here. It is documented at CUDA C++ Programming Guide — CUDA C++ Programming Guide . Please refer to this document for its requirements and how to use it. It is generally available in recent driver releases.

@elFarto please give it a try. I imagine it could be a default launch environment variable for the nvidia-vaapi-driver implementation.

@wpierce thanks for this update. However, my initial attempt to verify the behaviour in mpv doesn’t show any difference in power levels or power consumption. The GPU stays at P1 and doesn’t drop back down.

In contrast, this clever trick that disables the boost ioctl does bring the power level down as desired.

I wonder: what counts as a “GPU operation?”. Because the actual video decode pipeline requires doing CUDA memcpy operations to copy from CUDA memory to Vulkan images (there is no zero copy export API in CUDA). If those count as “GPU operations”, then this isn’t actually going to change anything. We really need a way to force the boost not to happen at all in that situation.

2 Likes

@wpierce Thanks for your reply. However like @langdalepl I can’t see any difference, likely due to the exact operations we need to perform to make the nvidia-vaapi-driver work. Or it’s just my card, it seems to be stuck at P0 permanently.

With that said I feel I should update my original post I made 7 years ago. Since the original post was made, I release the initial version of the driver that used EGLStreams to export the buffer from NVDEC/CUDA and get it to a dma-buf. That worked ok for a while until EGLStreams was broken in your driver, and that hasn’t worked since.

Because of that we’ve had to implement a new backend to directly poke the nvidia driver to get the buffer exported. This is a fairly fragile method, although we’ve been lucky and only have it break a few times in fairly minor ways.

However over the years of operating the driver it’s become clear that NVDEC just isn’t a good API to attempt to wrap with VA-API. NVDEC isn’t designed for this sort of playback inside a security concious environment. In addition to the power usage issue, there are others too:

  • An issue relating to NVDEC needing to know how many buffers/frames are needed upfront, which is the opposite to how VA-API (and VDPAU) operate which can create buffers adhoc. This can lead us to needing to allocate as many as 32 buffers, which can get expensive at higher resolutions.
  • Due to how NVDEC exposes the decoded frame we need to do 2 memcpy’s to get the data out, which is really unnecessary.
  • NVDEC/CUDA was never designed to run inside a sandboxed environment. My attempts to modify Firefox’s sandbox to accommodate would have significantly degraded the sandbox to the point it wouldn’t really be worth using it, hence why we’re still disabling it to this day. (I don’t really expect CUDA to work in this environment, as I say it’s just not designed for this.)
  • NVDEC doesn’t work on Wayland (at least not without the hack our driver has to do to get the buffer out)
  • Support for Optimus setups. To be fair, I don’t think this is an easy problem to solve.

With that said, what I would like to see is VDPAU improved. I did see some mention in one of your roadmaps about making VDPAU work with Wayland. Ideally I would love to see a function in VDPAU to export a VdpSurface as DMA-BUF (along with supplying the modifiers/stride/etc…). This isn’t difficult code to write as it’s basically what our driver is already doing. With this extra method we can use the original libva-vdpau-driver with some small changes, and will likely end up with a much nicer/more robust implementation.

I think that’s everything. I appreciate that you haven’t completely forgotten about us.

Regards
elFarto

2 Likes

Thank you ever so much for sharing this. For the longest time I have tried to find a solution for the issue of crazy high power usage and temperatures when using NVDEC with mpv (1080p video used to get me fixed at P1 and 80-85c temp with default fan profile on a 3070). I could never understand how this could be right. I now kind of understand it is due to NVDEC uses CUDA in linux.

Anyway this clever trick that you linked has sorted it all out, no longer pinned in P1 when playing video and therefore temps far more reasonable for such activity.

Hopefully something akin to this can be officially made into the driver.

Thanks again

This is fixed on the latest R580 and all future branches with the use of the CUDA_DISABLE_PERF_BOOST environment variable.

  • Added a new environment variable, CUDA_DISABLE_PERF_BOOST, to allow for disabling the default behavior of boosting the GPU to a higher power state when running CUDA applications. Setting this environment variable to ‘1’ will disable the boost.
3 Likes