Muliplication of many small matrices of different numbers of columns

I am right now working on a project which requires acceleration using GPUs. The project requires me to compute many small matrix-matrix multiplications. Each multiplication can be described by C = AB. All A matrices are of the same size, and B matrices are of many different sizes, from 8 columns to hundreds of columns (number of rows the same, due to the fact that A matrices are all of the same size). What should I do to complete this task efficiently using a GPU? My naive thought is to find a smallest minimal matrix size for all B matrices. Those with fewer columns than this minimal size are all padded by 0. Those larger matrices are divided to the minimal size. Then each multiplication is of the same size and occupies a single block. If anyone has worked on such problems, please let me know your solutions. Any suggestion is appreciated.

What is the approximate total number of matrix multiplications required?
Roughly how many different sizes need to be handled?
What is the fixed size of A?
What is the approximate range of sizes of B that need to be handled?
Can the matrices B easily be concatenated into an uber-matrix B’, so a single matrix multiplication can deliver the concatenated results AB’?

Thanks for replying. I realized that I should provide more details about my scalable solver project to get more solid suggestions. Let me try to formalize the problem.

Assume we have a set of coefficient vectors B. Each vector is of length p^2 (p is called truncation number in fast multipole methods). We also have a set of matrices A, each of size (p^2) * (p^2) (this should answer “What is the fixed size of A”). Each vector b in B is associated with a matrix T in A. Basically, my goal is to compute the multiplication of c = T*b for each b. The size of set A is dependent on the specific problem, but we can assume it to be smaller than 1000 for simplicity (this should answer “What is the approximate total number of matrix multiplications required”). Next I am going to introduce more details that may be used to improve efficiency.

First of all, each matrix T in A is a sparse matrix and can be decomposed into multiple blocks. Then, each multiplication of c = T * b can be decomposed into several multiplications c_i = T_i * b_i and c can be constructed through stacking (c_i)s. This modification in implementation should improve performance, as the total number of multiplications is reduced.

Further, different vectors (b)s can be associated to the same matrix T. In general, the number of vectors associated to each matrix is random, depending on the mesh layout and space division. Knowing this fact will help you understand why I formulated the problem as matrix-matrix multiplication at the first place. For each matrix T in set A, we may find all vectors (b_i)s in set B that is multiplied by T. Then we form a matrix S = [b_0,b_1,…,b_(N-1)]. Each column of S is a vector to be multiplied by T. By calculating C = T * S, we get all products associated with matrix T. Similarly, for other matrices in A, we may also follow such a procedure and calculate C = T * S, S varying from matrix to matrix. As the number of vectors associated with each T is different, S matrices are, in general, of different widths. The number of columns of S matrices ranges from very small numbers, 8, for example, to large numbers, 1200, for instance (this should answer “What is the approximate range of sizes of B that need to be handled”). Thus, we end up with a problem where multiplications of C = T * S, T varying and S of different widths, are to be conducted.

One problem I find for discussion, is how to express my formulation of problem clearly and concisely. If anyone feels confused with the statement, don’t hesitate to let me know. Thanks.