Help with GPU Cholesky factorization


I have a computing problem a I need to compute several Cholesky factorization at the same time. That is, I want to compute hundred of different Cholesky factorization in the GPU at the same time (with small matrices).

As far as I know, there are options to compute Cholesky in the GPU (maybe CUBLAS), but these options are developed with the aim to solve 1 Cholesky with a big matrix using all the GPU.
But this options is not the most appropriate for me.

any suggestion?

Thanks in advance.

I’m not sure whether CUBLAS has any built-in factorization algorithms, but V. Volkov published some stuff about it. I’m only a beginner, I can’t be much helpful, and the only thing I can say right now is you might want to check Volkov’s papers. Perhaps you’ve seen them already, though.

If I understand correctly, you are looking for the batched Cholesky factorization of many small matrices. I am not aware of a ready-to-use code for this at this time. While CUBLAS offers some batched operations, Cholesky factorization is not among them. On the registered developer website you can find code that performs batched solves and matrix inversions on small matrices, speeding these up by performing all operations in shared memory (which in turn limits the size of matrices that can be handled). You could use that as a template for the implementation of a batched Cholesky factorization, the code is under a BSD license.

Hi, thanks for the responses.

@Momonga, I read the Volkov papers and research. Is a very good work, but is developed with the aim to take advantage of entire GPU to compute one cholesky (using CuBlas).

@Njuffa, thanks for your response. Do you have any link or information about batched operations? I cannot find anything.
Im not sure, but I think is not an option for me.

I need to compute several (500) different cholesky factorizations (for small and different matrices) at the same time as fast as possible.

Thanks a lot!

Batched operations are designed for the efficient handling of multiple small matrices. A single small matrix only uses a fraction of the available computational capability of the GPU. By working on multiple matrices at once, one can utitlize all of the computational ability. Batched operations work best when there are thousands of small matrices, but still perform much better than dealing with matrices individually when there are 500 matrices. As far as I am aware, existing batched codes require all matrices in the batch to have the same size (this is not a restriction for typical real-world cases scenarios).

For the batched operations supported by CUBLAS, please consult the CUBLAS documentation (comes with your CUDA installation). For the source code of the batched dense solver and matrix inversion, please log into the registered developer website and download from there. To login (or register as a developer), please go to

Look for “CUDA Batch Solver” among the available downloads.

Thanks for the responses and for the links.

I see im not the only person with this problem. I going to develop a “batched cholesky”.

I was looking into small Cholesky factorizations like 32x32 but I decided to ignore the symmetry and go for full Gauss instead - it is hard to skip updating the upper half of a small matrix when computing in SIMD. You can check a sample code at:

Look at page 23 and on. It may be not optimal - possibly some other data layout might work better, but I think a better code would look conceptually similar.

Check also:

Anderson, M.J., Sheffield, D., Keutzer, K. 2012. A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers, IPDPS 2012