NVDEC/NVENC VRAM allocation differences between different GPUs

I am experiencing a weird vram memory allocation difference running same ffmpeg decoding with NVDEC in different GPU hardware.
A simple use of h264_cuvid like the following example:
ffmpeg -c:v h264_cuvid -surfaces 8 -f mpegts -i https://samples.ffmpeg.org/V-codecs/h264/HD-h264.ts -vcodec libx264 -preset veryfast -crf 23 -c:a copy -f mpegts transcoded.ts
allocates 153MB of VRAM in GTX 1070 Ti under Linux with drivers 390.77 (also tested 384.110) but it allocates only 87MB of VRAM in GTX 1050 Ti under Linux with same drivers.

Interestingly, GTX 1070 Ti under Windows allocates 132MB - less than Linux but more than GTX 1050 Ti. Unfortunately I couldn’t test 1050 Ti on Windows. Windows driver is 391 something (the one latest Windows 10 installs by default).

I have also tested memory allocation with AppDecode from Video SDK, and shows similar results.
205MB allocated on GTX 1070 Ti (higher because AppDec uses 20 surfaces), 139MB on GTX 1050 Ti.

Similar VRAM allocation differences happen in encoding too.

Is this a bug in the nvdec/nvenc libraries and SDK? Is this a bug in drivers? Or is this considered normal?
VRAM allocation for GTX 1070 Ti in Linux is quite high compared to both Windows and GTX 1050 Ti, thus limiting the number of concurrent decoding sessions - encoding is limited anyway in non Quadro.

Hi malakudi,

The VRAM allocation difference between GTX1070Ti and GTX1050Ti for encoding and decoding use cases is an expected behavior and not a bug. This difference is due to the memory allocations done by Cuda driver.
Cuda conservatively allocates execution resources based on the capability of the chip. Since GTX 1070 Ti has more SMs available, the resource allocation is expected to be proportionally larger than GTX 1050 Ti.
For running multiple concurrent decoding sessions, we recommend creating a single Cuda context and sharing that amongst multiple decode sessions, this will reduce the overall VRAM consumption and support higher number of concurrent sessions.

Ryan Park

OK, so the difference is a default minimum memory allocation for a CUDA context. This of course doesn’t explain why driver for Windows has lower allocation when compared with Linux driver.
Anyway, the problem is that in my usage case, I use multiple instances of ffmpeg to do the decoding. So I am hit with a penalty of a default allocation that is not really needed, since doing only decoding for example does not use SM, it only uses NVDEC. Maybe you could find a way to fix this.