Here is the setup:
Single context, two threads - one where CUDA kernels are scheduled and another where NVENC is called.
All operations including NVENC are on non-default streams. No dependency between CUDA surfaces and NVENC.
What we observe is that CUDA operations are not executed by GPU until NVENC finishes all frames submitted before CUDA operations submitted.
CPU Thread 1: -----CCCC
CPU Thread2: EEEE------
© - Cuda; (E) - nvEnc
“E” period takes ~1.5ms and is 95% empty. All “C” will fit inside “E” period easily.
We are puzzled why everything is serialized.