Nvenc performance degredation fromCUDA 11.4.2 to CUDA 11.6.2

We have noticed that encoding with FFmpeg and nvenc in realtime the CPU utilization more than triples from CUDA version 11.4.2 to 11.6.2. At the same time, using CUDA 11.4.2 achieves slightly more frames per second.

We also observed that FFmpeg uses 15 threads on CUDA 11.4.2 and 102 on CUDA 11.6.2, which may be related to this issue.

We used a python script to start and monitor the FFmpeg process that we can provide

We used the official FFmpeg 5.0 binary with nvenc and h264 from GyanD (link removed)

Video URL: https://media.xiph.org/video/derf/y4m/touchdown_pass_1080p.y4m

CUDA CPU Load (mean) [%] CPU Load STD [%] CPU Load MIN [%] CPU Load MAX [%]
11.6.2 2.02 0.47 1.32 2.69
11.4.2 0.66 0.15 0.46 0.93

FFmpeg command:ffmpeg -y -re -stream_loop 2 -i touchdown_pass_1080p.y4m -c:v h264_nvenc -b:v 10M touchdown_pass_1080p.mp4

Used system:

CPU: AMD Epyc 7352 (24 cores, 48 threads)

GPU: Nvidia Quadro RTX 4000

Memory: 64GB @ 3200 MHz (enough to cache all videos)

Mainboard: Gigabyte MZ32-AR0

OS: Windows 10 Pro 21H2 64-bit

CUDA Toolkit versions:

cuda_11.4.2_471.41_win10

cuda_11.6.2_511.65_windows

Does anyone know whether this is a regression/bug or how to work around this?

Retest with normal binary, GitHub - BtbN/FFmpeg-Builds

CUDA toolkit does not play any role in this, it ia not used for compilation, clang is used.