We have a multithreaded application that uses the hardware decoders on GeForce and Quadro cards to drive the NV12 to RGBA conversion in CUDA and then map these into OpenGL Pixel Buffer objects that then get turned into textures and rendered. We have one decompression thread per stream and are running 4K H.264 streams. (These are then stitched and rendered, but that is not relevant to the question. We’re using fences and mutexes to make sure everything ends up in the right place and we’re decompressing into sets of output buffers and then playing back from the textures in those buffers in another thread.)
Our code is a heavily modified version of the nVidia decode example code. We’re using CMake to compile under both Windows 10 and Linux, and GLFW as a cross-platform windowing library (we can also use GLX on Linux, getting the same performance).
When we run the code on a GeForce 1070 on an Alienware laptop running Windows 10 (current driver version 417.35, but it does not seem to have changed over many versions) we can get 38 frames/second of decode on five 4K streams (3840x2160). When we run on another GeForce 1070 Alienware laptop running Ubuntu 16.04 or on a quadro P4000 in a rack-mounted server (current driver version 410 but it does not seem to have changed over versions) we can barely get 30fps. If we run multiple processes, we get 30fps; if we run multiple threads on the different GPUs we fall just short (29fps).
My question is whether this sort of performance difference is expected due to heavily optimized drivers on Windows or whether we should keep digging into the code to try and figure out what is going on.