Decomposing GEMM and run in two separate stream

I have two questions about the GEMM.

  1. Is it possible to decompose a GEMM operation in 2 different kernels and run them in 2 parallel streams to get better performance?

  2. What happen for performance if we split the GEMM in a small GEMM (small GEMM means 10% of rows of the square matrix) and a big GEMM in 2 different kernels and run them in 2 parallel streams ? Does it get faster or slower overall?

Thanks

Dividing up kernel execution work into smaller kernels by itself generally has no performance benefit.

However dividing up a GEMM into streams for the purpose of copy-compute overlap can have a significant benefit.