I have two questions about the GEMM.
-
Is it possible to decompose a GEMM operation in 2 different kernels and run them in 2 parallel streams to get better performance?
-
What happen for performance if we split the GEMM in a small GEMM (small GEMM means 10% of rows of the square matrix) and a big GEMM in 2 different kernels and run them in 2 parallel streams ? Does it get faster or slower overall?
Thanks