To make full use of L1, how to merge kernels with different grid/block size?

Hi! I am trying to make full use of L1, like store more data in registers or shared memory, for reuse between kernel 1 and kernel 2. So this forces me to merge the kernels. But how to merge kernels with different grid/block size?

Like, if the block size is the same, I am considering using producer-consumer structure. Like:
if(gridDim == … )
But the consumer blocks may spin and have nothing to do! The active block number is fixed, so they will waste the occupancy.

What should I do? Any suggestions or papers? Thank you!!

I thought about dynamic parallelism…but the shared memory can not be shared between parents and children…

Someone suggests async data transfer? Well… really can be used here?

How different are the launch configurations of these two kernels? If they are not too dissimilar, I think it is worth trying to launch with the larger launch configuration and then have some of the launched blocks do nothing when performing the work of the smaller kernel. This assumes that this merging makes sense from a performance perspective in the first place, and without more context information that is something I would question.

hi, njuffa! You have answered me tons of questions for several years! It’s very happy to see you!

For my question, I am considering basic AI blocks, like GEMM or conv. Do you think reuse the data between L1 could be beneficial here? Well, obviously another method is to redesign the kernel to make them equal… not easy…trade-off here.

A typical approach would be to use a grid-stride loop. This allows you to decouple the choice of the grid size from the size of the problem being solved.