I am looking for a matrix factorization algorithm for banded matrices that is also efficient to implement in CUDA. I’ll be using this to solve linear equations. The matrices I’ll be using are about 6000x6000 elements with a band width of about 60. Looking at vvolkov’s work, QR factorization is the most efficient factorization in terms of flops for dense matrices. Since the matrices are symmetric positive definite, I can also use the Cholesky decomposition or solve the system of linear equations using the conjugate gradient method. I’d appreciate any suggestions for which method is the fastest on a GPU.
If anyone can suggest libraries that supply this functionality, I’d greatly appreciate it.
Thanks for the link. Banded matrices are a type of sparse matrix where only the main diagonal and a few diagonals next to it store any values. The paper you provided a link to is probably too complex, but it’s still a starting point so I will see what I can make of it.
I’ve also been searching for anything that compares the various methods (Cholesky, QR, Conjugate Gradient, etc.) so that I can see which one is the fastest for banded matrices. On a hunch, I’d guess that the answer is Cholesky factorization simply because that’s the one that is mentioned most often in the literature I’ve seen. However, I would like someones experiences here since I haven’t yet seen a paper comparing the various methods with regards to time taken to do the factorization, or solve the linear system of equations.
I’m currently solving a problem were I’m doing hundreds of small QR decompositions in parallel and then finally solving my systems of equations with a backsolve.
But my problem doesn’t seem to suit yours since I’m working on many much smaller matrices.
Since you are working with sparse matrices, is the conjugate gradient method perhaps the way to go ? I’m not sure how parallizable it is though…
Looking through a book I have which describes the conjugate gradient method, it seems simple enough, so I tried to implement an unoptimized version today. I still need to find a good reference that describes how to work out the preconditioner matrix. This method looks like something I can use, but only if it can solve a problem in a small number of iterations. Since CUBLAS has a function for multiplying a banded matrix by a vector, it should be easy to optimize it for my banded matrices.
Still, I would like to know how the various methods compare, without having to implement and test all of them.
Your matrix is banded, but is it SPD? That can have a big bearing on what method to use. In my (not exhaustive) experience, for linear problems with SPD matrices, the preconditioned CG method is probably more efficient than direct factorisation if you have a suitable preconditioner which is easy to implement in CUDA which allows you to keep the data in the gpu for a complete CG step at a time. That can be a pretty big if, depending on what sort of problems you are working on.
Yes, it is SPD. It comes from a quadratic programming minimization problem where it has to be SPD. The preconditioner is a pretty big if. Right now, my implementation doesn’t converge with the test matrix I am using. I’ve tried using the preconditioning matrix from the Symmetric Successive Over Relaxation method, but that isn’t working either yet.