We are tuning our application on T4, and having a problem as depicted in the following figure:
The highlighted volta_hgemm kernel and the kernel before it are not on the same stream. The execution of the volta_hgemm kernel is delayed until the previous kernel finishes. But the previous kernel is actually using the following resources:
Grid size: [36, 1, 1]
Block size: [256, 1, 1]
Shared Memory/Block: 520 B
And here are the resources used by the volta_hgemm kernel:
Grid size: [5, 4, 7]
Block size: [64, 1, 1]
Shared Memory/Block: 6.25 KiB
So the previous kernel cannot take up all resources on T4, and the volta_hgemm kernel should start to run sooner.
Does anyone know what is the cause of the delay, and how to fix it?
Thanks in advance.