Help with GPU Cholesky factorization

jmlero · January 10, 2013, 1:07pm

Hi,

I have a computing problem a I need to compute several Cholesky factorization at the same time. That is, I want to compute hundred of different Cholesky factorization in the GPU at the same time (with small matrices).

As far as I know, there are options to compute Cholesky in the GPU (maybe CUBLAS), but these options are developed with the aim to solve 1 Cholesky with a big matrix using all the GPU.
But this options is not the most appropriate for me.

any suggestion?

Thanks in advance.

Momonga · January 10, 2013, 3:34pm

I’m not sure whether CUBLAS has any built-in factorization algorithms, but V. Volkov published some stuff about it. I’m only a beginner, I can’t be much helpful, and the only thing I can say right now is you might want to check Volkov’s papers. Perhaps you’ve seen them already, though.

njuffa · January 10, 2013, 7:41pm

If I understand correctly, you are looking for the batched Cholesky factorization of many small matrices. I am not aware of a ready-to-use code for this at this time. While CUBLAS offers some batched operations, Cholesky factorization is not among them. On the registered developer website you can find code that performs batched solves and matrix inversions on small matrices, speeding these up by performing all operations in shared memory (which in turn limits the size of matrices that can be handled). You could use that as a template for the implementation of a batched Cholesky factorization, the code is under a BSD license.

jmlero · January 11, 2013, 12:13pm

Hi, thanks for the responses.

@Momonga, I read the Volkov papers and research. Is a very good work, but is developed with the aim to take advantage of entire GPU to compute one cholesky (using CuBlas).

@Njuffa, thanks for your response. Do you have any link or information about batched operations? I cannot find anything.
Im not sure, but I think is not an option for me.

I need to compute several (500) different cholesky factorizations (for small and different matrices) at the same time as fast as possible.

Thanks a lot!

njuffa · January 11, 2013, 7:27pm

Batched operations are designed for the efficient handling of multiple small matrices. A single small matrix only uses a fraction of the available computational capability of the GPU. By working on multiple matrices at once, one can utitlize all of the computational ability. Batched operations work best when there are thousands of small matrices, but still perform much better than dealing with matrices individually when there are 500 matrices. As far as I am aware, existing batched codes require all matrices in the batch to have the same size (this is not a restriction for typical real-world cases scenarios).

For the batched operations supported by CUBLAS, please consult the CUBLAS documentation (comes with your CUDA installation). For the source code of the batched dense solver and matrix inversion, please log into the registered developer website and download from there. To login (or register as a developer), please go to

[url]https://developer.nvidia.com/joining-cuda-registered-developer-program[/url]

Look for “CUDA Batch Solver” among the available downloads.

jmlero · January 15, 2013, 11:41am

Thanks for the responses and for the links.

I see im not the only person with this problem. I going to develop a “batched cholesky”.

vvolkov · January 17, 2013, 1:25am

I was looking into small Cholesky factorizations like 32x32 but I decided to ignore the symmetry and go for full Gauss instead - it is hard to skip updating the upper half of a small matrix when computing in SIMD. You can check a sample code at:

http://www.eecs.berkeley.edu/~volkov/volkov11-unrolling.pdf

Look at page 23 and on. It may be not optimal - possibly some other data layout might work better, but I think a better code would look conceptually similar.

Check also:

Anderson, M.J., Sheffield, D., Keutzer, K. 2012. A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers, IPDPS 2012

Vasily

Topic		Replies	Views
LU factorization code CUDA Programming and Performance	45	90770	June 23, 2015
Accelerate Cholesky function in cuSolver. GPU-Accelerated Libraries	0	401	June 18, 2019
cuBLAS sgetrf_batched on 100*100 matrices SLOW GPU-Accelerated Libraries	6	1575	February 29, 2016
LU, QR and Cholesky factorizations using GPU CUDA Programming and Performance	100	62685	June 23, 2015
Batched solver code available CUDA Programming and Performance	29	14595	July 17, 2023
Pro Tip: cuBLAS Strided Batched Matrix Multiply Technical Blog	11	936	February 16, 2018
Matrix multiplication of many small-sized matrices CUDA Programming and Performance	3	1497	March 30, 2020
numerous, but small-sized matrix inversions looking 4 advise how-to speed-up problem CUDA Programming and Performance	4	3163	August 20, 2008
Poor results with LAPACK CUDA Programming and Performance	8	12150	March 25, 2009
Having multiple relatively small problems GPU-Accelerated Libraries cublas , cusolver	5	801	April 7, 2022

Help with GPU Cholesky factorization

Related topics