numerous, but small-sized matrix inversions looking 4 advise how-to speed-up problem

Hi there,

after spending a lot of time reading through CUDA tutorials, trying simple examples, etc., I’m finally about to speed-up an existing source code on my own.

The basic structure is as follows:

for(a = 1..10)


   // do sthg

   for(b = 1..100)


      // do sthg

      for(c = 1..100)


         // do sthg

         multMatMat(A, B)    // BLAS3

         invMat(A)               // most computational intense task

         multMatVec(A, b)    // BLAS2

         multVecVec(a, b)   // BLAS1




The size of matrices are 80 x 80, and the vectors involved are 80 x 1. The elements itself are “complex floats”. And I was about to use CUBLAS in the beginning.

Now, from reading through many of your posts I conclude that 80x80 is considered to be a rather small matrix in GPU world. However, I think despite these small sizes there must be a lot of speed-up potential, since the for-loops are executed very often (let’s say about 100.000 times for a start, but it could be millions and more).

Of course, my concern is that there are some other computations in between these loops (not only matrix operations), which might impede developing a fully parallel solution. However, the results of one such loop cycle are NOT dependent on any other execution. Therefore, there should be a lot of potential, no?

So my real questions are:

  • can somebody suggest a way to speedup the basic problem stated above?

  • I tried my luck with CUBLAS so far; what I am afraid now is that CUBLAS is probably not suited for such small matrices - so do I need to do it in “native CUDA” (specifying the grids/blocks and implementing matrix multiplication/inversion myself) - would this bring a benefit?

  • is somebody aware of a way to do matrix inversion (or at least LU decomposition) in CUDA (from what I see, CUBLAS cannot help me to solve this issues by default)? I read some threads about it in this forum, but none of them seem to have solved matrix inversion/LU decomposition just by using CUBLAS (or CUDA)?

Any hints are appreciated!



That reasoning is not quite right. CUDA has a startup overhead of at least 20us, so you can not expect any speedup for operations that take less than about 40us on the CPU. And even that is assuming that all data is already on the GPU and will stay on the GPU.

Since an 80x80 matrix fits in the cache I would guess that e.g. the MatVec will take about 9 us (assuming a by todays standards rather slow 2GHz CPU, using only one core), the scalar product even less, but you might get lucky and be able to make up for the speed loss there in the inversion and matrix-matrix multiplication. IMHO expecting a 4x speedup would be already very optimistic though, unless you can parallelize this even more than it currently looks (doing multiple matrix-matrix multiplications at the same time or so).

If you manage to prove my pessimism wrong I’ll be happy to hear about it though :D

Hey. If it’s true that the results in the “C for loop” have nothing do with with each other, then there might be some good potential speedup. About how long does it take for your computer to execute all 100 iterations of the C loop? If it takes even a second, chances are the GPU will speed up your application a lot if it’s coded well.

I don’t know of any available LU factorization implementations, but yesterday I started working on a cholesky factorization code, which is very similar to LU factorization. I’ll probably post my code in about a week after some testing, it can probably be easily modified to suite your needs. But it might be faster for you to just code your own implementation of LU if you don’t want to wait, or modify someone else’s code to suit your own needs ;)

Thanks for your responses so far.

Actually, what I meant was that I could execute some of the for-loops in parallel (e.g. the inner for loop, with about 100 - 5000 iterations), because the data is independent from each other (but it’s a different data though, only the operations are the same). That should provide for a lot of speedup potential, right?

My question then was, if there is a way to do such things in CUBLAS (= execute multiple matrix multiplications in parallel), or - more likely - do I have to use native CUDA to divide my data into grid/blocks and apply matrixMultiplication with my own implementation of that multiplication function?

@ColinS: sure, I would definitely be interested in taking a look at your cholesky factorization code, and hear how much speedup you could observe?

by Michael

Well, the answer for that is yes and no. One thing you could do is create a large sparse matrix, and fill the near-diagonal elements of the large matrix with your smaller matrices. In your case, you could probably load one or two hundred of your smaller matrices into the near-diagonal of the larger matrix without running out of memory on your graphics card. Then, when you take the inverse of the larger matrix, you actually end up with the inverses of all your smaller matrices.

Here’s the catch: I don’t think Cublas supports sparse matrices, which means it’s going to perform the operation on the entire matrix, which could potentially make the calculation even slower than the CPU version because so much more work is being done. I’m not too familiar with Cublas, but I believe this is the case. There may be an optimal number where you would still achieve a speedup even though more calculations are being done. Unfortunately, I think if you want to get the absolute best speedup you could possibly get, you might have to write a CUDA kernel yourself.

Even though you have thousands of small matrices, 80 is still pretty small. If all the iterations of the inner loop can be done on a CPU in less than a second, then implementing it in CUDA may not present such a great speedup. You should definitely time how long it takes your CPU to calculate all the iterations of the inner for loop ;)

I’ll try to have some working code for Cholesky factorization by Monday. Some other things have come up in my project, and the factorization isn’t the highest priority right now :P