Delayed kernel execution

Hello,

We are tuning our application on T4, and having a problem as depicted in the following figure:

https://postimg.cc/47XMGRjh

The highlighted volta_hgemm kernel and the kernel before it are not on the same stream. The execution of the volta_hgemm kernel is delayed until the previous kernel finishes. But the previous kernel is actually using the following resources:

Grid size: [36, 1, 1]
Block size: [256, 1, 1]
Registers/Thread: 51
Shared Memory/Block: 520 B

And here are the resources used by the volta_hgemm kernel:

Grid size: [5, 4, 7]
Block size: [64, 1, 1]
Registers/Thread: 127
Shared Memory/Block: 6.25 KiB

So the previous kernel cannot take up all resources on T4, and the volta_hgemm kernel should start to run sooner.

Does anyone know what is the cause of the delay, and how to fix it?

Thanks in advance.