cuBLAS GemmStridedBatched: Algorithm selection

Hi,
I am currently implementing a batched matrix processing for 80k (single precision, complex) matrices of various size. I observe very good performance when using 12x12 matrices, however for 24x24 matrices I see in NVVP that the GPU utilization is decreased to 1/3. By using the GemmStridedBatchedEx function and selecting CUBLAS_GEMM_ALGO22, I was able to get better perfomance (although NVVP complains about non-ideal shared memory access patterns).

Could NVIDIA provide an overview what are the differences of the algorithms 1-23? In addition, how does the usual GemmStridedBatched select a specific algorithm? What are the boundaries where it changes the implementation? The documentaiton doesn’t provide any information.

Peter