Question about cublas and optimizing multiple matrix operations

Hi guys, I just started looking into cuda programming a 2 days ago to see if i could improve a particle swarm’s performance. While I can see the performance benefits, I’m still finding it hard understanding how to parallelize operations effectively. One example is whether to use cublas versus creating my own kernel. In all the examples i’ve seen, cublas functions are called on big matrices that can use all the cores. If I have multiplications between small matrices can I use cublassgemm concurrently?

I would also like to know how best to convert my matrix calculations in to cuda code.

for example, there are n particles and each has a bunch of matrices
Pi : {Ai, Bi, Ci, … Ei) where Ai, Bi…Ei are integer matrices of varying sizes.

A1 of particle 1 has the same dimensions as A2 of particle 2 and same with B1 and B2 etc.

for each particle i need to perform an equation something like this:
((Ai.Bi-Ei).Ai) + (Di(transpose).Bi.Ci).Ai + ((Ai-Bi(transpose).D)

What is the best way to implement this? should it be a for loop with multiple cublass operations for each part of the formula like


for (int i = 0; i < n; i++) {
cublasgem
cublasaxpy
cublasgem

}

or is there a way to put it all in one function. Also given that some of the matrix multiplications are pretty small should I just make my own kernels to ensure all the cores are being used? Should I do operations particle by particle or do operations on multiple particles at once. Thanks for the help

Environment
Operating System: Windows 10

IDE: Visual Studio 2019

GPU: NVIDIA RTX 2080TI

CUDA version: 10.2

There are batched cublas operations that might make sense if you are working with a larger number of smaller matrices. Just search through the cublas documentation for the word batch

thanks for the reply, from what I understand all the batched cublas operations are for for uniform matrices only right, like 5 n*m matrices. what about for 5 matrices of different numbers of rows and columns

That’s correct.

If none of your operations are similar across particles then disregard my comment.