I would like to process a batch of small matrices within CUDA. So far, I used the GEMMstridedBatched routines from CuBLAS. Now I have an array containing a batch of many “concatenated” small matrices in column-major format as typical in cuBLAS. I would like to calculate the matrix product of all these matrices in one array, thus reduce all M NxN matrices to one NxN matrix by calculating A1 * A2 * … * Am. Is there a possibility to do something like that in cuBLAS? I looked also at cuBLASLt but I couldn’t see how one could exploit the operation descriptor formalism to achieve this reduction operation. Do you have any hint for me?