How is cuda context number related to nvdec engine number?

Talsa T4 has 2 decode engines.
An experiment shows that only 1 cuda context for decoding cannot
maximize the decoding throughtput, while 2 cuda ctx almost fully utilize decode engines.
Why is it like this?

It’s not constrained by having 1 or 2 cuda context (there really isn’t anything except the true primary context anyway), but rather by 1 or 2 video streams being decoded. There is a sequential dependency between frames being decoded, so you need at least as many decoder instances as there are decode engines in order to have enough independent work available.