Hi guys, I just started looking into cuda programming a 2 days ago to see if i could improve a particle swarm’s performance. While I can see the performance benefits, I’m still finding it hard understanding how to parallelize operations effectively. One example is whether to use cublas versus creating my own kernel. In all the examples i’ve seen, cublas functions are called on big matrices that can use all the cores. If I have multiplications between small matrices can I use cublassgemm concurrently?
I would also like to know how best to convert my matrix calculations in to cuda code.
for example, there are n particles and each has a bunch of matrices
Pi : {Ai, Bi, Ci, … Ei) where Ai, Bi…Ei are integer matrices of varying sizes.
A1 of particle 1 has the same dimensions as A2 of particle 2 and same with B1 and B2 etc.
for each particle i need to perform an equation something like this:
((Ai.Bi-Ei).Ai) + (Di(transpose).Bi.Ci).Ai + ((Ai-Bi(transpose).D)
What is the best way to implement this? should it be a for loop with multiple cublass operations for each part of the formula like
for (int i = 0; i < n; i++) {
cublasgem
cublasaxpy
cublasgem
…
}
or is there a way to put it all in one function. Also given that some of the matrix multiplications are pretty small should I just make my own kernels to ensure all the cores are being used? Should I do operations particle by particle or do operations on multiple particles at once. Thanks for the help
Environment
Operating System: Windows 10
IDE: Visual Studio 2019
GPU: NVIDIA RTX 2080TI
CUDA version: 10.2