cuda stream high priority could not always schedule high prority

In c++ project, we have a high priority stream. most of time it can be schedule first, but sometimes it’s not.

runtime prof as below
if you could not visit this url, the attachment is the same picture。
Is there any idea? Thanks for your answer。

It might be that you have a misplaced cudaStreamSynchronize() that is preventing your higher priority stream kernel from getting scheduled, as your picture seems to be suggesting. There is really no way to confirm this from a static picture of the profiler output.

It doesn’t seem to apply to your case, but you should also be aware that depending on the GPU, blocks from high priority streams must wait for available space on the GPU SMs. If blocks from low priority streams get scheduled first (e.g. because they were launched first), and they fully occupy an SM, then subsequently issued blocks from higher priority streams may have to wait until space (resources) is available on the SM, before they can be scheduled.

Hi, we use TensorRT as inference framework. And we calling enqueue() from multiple host threads.
When we create shuffle layer as below,
nvinfer1::IShuffleLayer *shuffleLayer = net->addShuffle(*inputs[0]);

We found enqueue() kernels in multi-thread which use multi-stream can not execute parallel in GPU.
The question is TensorRT shuffle layer preventing other stream kernel scheduled?