NVCUVID performance while decoding multiple videos at the same time

Hi. In my current project I need to decode and process a very large number of videos. I followed the NVCUVID sample provided by the CUDA toolkit and wrote a an application to see whether use of CUDA for this specific problem would even make sense. The testing machine includes 2x nVidia GRID K2 GPUs and 40 core CPU.

In the test application I created 40 threads assigning one GPU per 10 threads, then using the combination of video source, video parser and video decoder I decode 1000 short videos (average duration is 6 seconds and average size is 480x480) and then 100 long videos (average duration is 5952 seconds and average size is 1670x808). As it turned out, the test application using NVCUVID performed in average 7x worse than similar application written using FFMPEG for short videos and 1.6x worse for long videos.

From my observation it seems that the main bottleneck is the initialization process and I/O in general. Is there any way to speed up the process? Any quirks to avoid? Is it even feasible to do such scenario on GPUs?