Matrix multiplication of many small-sized matrices

I am new to the CUDA programming world and would like to start with the following problem: I would like to multiply a huge batch of matrices together. All matrices origin from two “base” matrices A and B. In addition, I have an array v. I would like to calculate for each value in v the matrix C = A + v[i]*B, then apply a matrix function to the resulting matrix, thus obtain D = func© (also on the GPU) and finally I would like to obtain the matrix product of all matrices D, thus D1 * D2 * D3 etc. The matrices A and B are of the order of 200 x 200, but the array v can be long.

Since I am new to the CUDA world, I would like to ask how to approach such a problem. I have taken a look at cuBLAS and the gemmBatched functions, but they are optimized matrix multiplications for two arrays each containing matrices. I would like to take this problem to learn more about doing linear algebra with CUDA. I am grateful for any hints how to start…


I think you may need to have to experiment with this. I did some of the early exploratory work for batched operations in CUBLAS, and as I recall batching is a great approach for matrices that are small, say smaller than 50x50 or 30x30 depending on element type. On the other hand, non-batched CUBLAS APIs usually showed good performance for matrices larger than 256x256. Problematic from a performance perspective were the medium-sized matrices like in your use case. Too big to be handled efficiently by one thread block, too small to be efficiently handled by an entire grid.

Now, this work happened years ago, and my observations may no longer apply. Another big unknown here is the “matrix function”. Is that a cheap or a costly operation? We don’t know.

The way I would approach this is to first build a naive version from known building CUBLAS blocks. The initial step in the process looks like *AXPY to me, then we have the application of the matrix function, then a whole bunch of calls to *GEMM.

At the end of this we would have working code and a working test framework, and can examine the performance characteristics of this naive solution with the profiler. This will all be good practice for learning CUDA and CUBLAS, and provide a lower performance bound for subsequent improved versions.

In terms of a custom solution not relying on libraries, you might want to think how the application of the matrix function could possibly be combined with some of the other steps, and look at some tiled matrix operations (with use of shared memory for caching of tiles) in the CUDA sample apps. The general idea is to minimize data movement.

So concerning the matrix function, this will be a matrix exponential based on the scaling-and-squaring algorithm and a Pade approximation. Sorry that I omitted that.

However, I feel a little lost in the overall design idea. I can’t really compute expm(A+v[i]*B) for all values in v (this can be between 1e5 elements) and store the matrices before doing the matrix multiplication reduction. This would simply use too much memory. On a CPU, I would implement a recursive algorithm that divides the array further and further until only one element is left. Then I would calculate the matrix exponential and the return values are reduced by the matrix multiplication in the “next higher level” of the recursion function. However, it seems to me that this approach is not very efficient on a GPU since stack size is limited. In addition, the shared memory is too small to handle on matrix per block as you said. So I am a little unsure how to start…


I have no practical experience with matrix exponentials, and am familiar with Pade approximations only in the context of scalar functions., so can’t offer any wisdom here. It now seems that the matrix function is actually the most computationally intensive part of the overall computation. Being a performance-oriented kind of guy, I find that recursion is hardly ever the right answer to anything. Even more so in the context of GPUs.

I assume you have done a thorough literature search to find out how other folks have gone about this (or similiar) issues? If this were de-novo research, and I would be assigned to the task, I would start by first looking into scaled-down simplified versions of the computation to gain familiarity with the programming environment and the basic properties of the computation, then use any knowledge gained to scale up to the full problem step by step: Crawl before you walk, walk before you run. That would obviously take a few months, which (to me) is the nature of research.