LU, QR and Cholesky factorizations using GPU

Sarnath · June 26, 2009, 12:33pm

Vasily,

Congrats on the Good work!

I have a rather naive question.

I see you have used single precision for the calculations.

Is it “ok” to have solvers written in single precision?
Is double precision absolutely necessary?
Appreciate any help.

We ourselves are looking to GPUize a L-D-L(T) factorization. The questions are from that context. FYI.

Thanks,

Best Regards,
Sarnath

vvolkov · June 26, 2009, 3:17pm

In fact, I’m not sure. I started working on sparse solvers quite recently - should have a better idea in a few months ;-)

It depends on the accuracy you need. In some applications single precision is sufficient.

Also, if condition number is small enough, you can get full double precision accuracy in Ax=b using single precision factorization and iterative refinement.

Vasily

jam11 · June 26, 2009, 3:23pm

Hi,
Is there a way to use the ATLAS package instead of mkl/VSL combo ?

For the random number, the standard can be used.

vvolkov · June 26, 2009, 3:47pm

Sure, this should be straightforward. There are only a few LAPACK/BLAS routines that I call in MKL.

Vasily

Sarnath · June 28, 2009, 2:40pm

Thank you, Vasily!

kerrmudgeon · August 3, 2009, 4:08am

People use Givens rotations to compute QR for the HPEC Challenge benchmark because the benchmark specification requires it. Specifically, it requires the “fast Givens” computation that reduces the number of square roots by an order of magnitude.

The “fast” method incurs many dependencies, and reciprocal square roots are easily parallelizable on GPUs. You might be able to come up with an efficient strategy for rotating blocks using Givens rotations, but the well-known method of parallelizing it requires global synchronization.

Good luck!

zhenyu · August 3, 2009, 7:39am

Thank you for your reply.

I have tried different methods of Givens QR factorization, they all failed to get good performance on GPUs.

I came up with some formal analysis to explain why they fail.

I will post my analysis very soon.

avidday · September 26, 2009, 11:55am

Now that downloads are working again, I finally was able to get a copy of Vasily’s code and try it out on some new hardware I just got hold of, and the results are, well, interesting to say the least.

I built the code with ACML 4.3.0 and gcc 4.3, using a 3.0Ghz Phenom II 945 on a 790FX board with 8Gb of DDR3-1333 CL9 memory and a pair of GTX275s, running Ubuntu 8.09 with CUDA 2.3 and 190.18 drivers.

The performance of the hybrid solvers is impressive - they all peak over 300 single precision GFLOP/s per second. But look at a couple of the results (suspiciously powers of two). At first I assumed I had some really dodgy PCI-e bandwidth issues or something. It wasn’t until I ran the cpu only versions of the benchmarks that it became obvious that it was the CPU side library that was causing the problems. And look at what a misbehaving host library can do to the performance! It has been several years since I tried ACML for anything serious, and I must say it has improved a lot, but the variability in performance is just awful.

The moral of the tale - know thy host library. And thanks to Vasily for doing the hard thinking and making such great code available to the rest of us.

MMB · September 26, 2009, 12:44pm

Now that downloads are working again, I finally was able to get a copy of Vasily’s code and try it out on some new hardware I just got hold of, and the results are, well, interesting to say the least.

I built the code with ACML 4.3.0 and gcc 4.3, using a 3.0Ghz Phenom II 945 on a 790FX board with 8Gb of DDR3-1333 CL9 memory and a pair of GTX275s, running Ubuntu 8.09 with CUDA 2.3 and 190.18 drivers.

The performance of the hybrid solvers is impressive - they all peak over 300 single precision GFLOP/s per second. But look at a couple of the results (suspiciously powers of two). At first I assumed I had some really dodgy PCI-e bandwidth issues or something. It wasn’t until I ran the cpu only versions of the benchmarks that it became obvious that it was the CPU side library that was causing the problems. And look at what a misbehaving host library can do to the performance! It has been several years since I tried ACML for anything serious, and I must say it has improved a lot, but the variability in performance is just awful.

The moral of the tale - know thy host library. And thanks to Vasily for doing the hard thinking and making such great code available to the rest of us.

Hi Avidday. Your results are interesting. To what do you attribute the sharp dip in performance, in both graphs, at around 8,000?

Malcolm

avidday · September 26, 2009, 2:35pm

There are actually several dips, although the one at N=8192 is by far the worst. It is definitely something in the ACML BLAS or LAPACK implementation (probably SGEMM but I haven’t done anything to confirm it yet).

JulienECE · March 1, 2010, 11:14am

Good morning,

      I working on a computational finance project and I need an [b]LU decomposition[/b] ([b]factorization[/b]) implementation on CUDA.

I see Vasily Volkov implementation but I try to use it without mkl library and it doesn’t work. Some functions don’t appear in LAPACK library and volkov library.

I found a cuda library that implement LU Decomposition named Cula (http://www.culatools.com/functions) but in basic version, the LU function (culaDeviceSGETRF => Computes an LU factorization of a general matrix, using partial pivoting with row interchanges) doesn’t work for me (Data error).

I extremely need LU decomposition to solve a linear system !

Thanks for all your answers.

Julien

vvolkov · March 1, 2010, 12:05pm

Can you be more specific in saying what doesn’t work?

I used some Intel-specific random number generator for testing purpose, but otherwise you should be able to plug in any other LAPACK implementation like ACML, possibly adopting to its calling conventions. This should amount to changing ~2 lines in gpu_sgetrf.cpp.

I tried to make the code as-easy-to-read-as-possible rather than as-easy-to-compile-as-possible. Sorry about that :-P

Vasily

JulienECE · March 1, 2010, 7:33pm

I use LAPACK and I substituate your randomly matrix filling function by random function from stdlib.

And when I compile, I have theses errors:

[codebox]gpu_lapack/gpu_sgetrf.cpp:76: error: â€˜sgetrfâ€™ was not declared in this scope

gpu_lapack/gpu_sgetrf.cpp:103: error: â€˜strsmâ€™ was not declared in this scope[/codebox]

Thank you for your answers

avidday · March 1, 2010, 8:39pm

You need to add prototype declarations for sgetrf and strsm which match the function names in your library. Perhaps something like:

extern "C" void sgetrf_(int *m, int *n, float *a, int *lda, int *ipiv, int *info);

extern "C" void strsm_(char *side, char *uplo, char *transa, char *diag, int *m, int *n, float *alpha, float *a, int *lda, float *b, int *ldb);

might work if you are using the netlib lapack and g++/gfortran.

JulienECE · March 1, 2010, 10:11pm

[quote name=‘avidday’ post=‘1010329’ date=‘Mar 1 2010, 09:39 PM’]

You need to add prototype declarations for sgetrf and strsm which match the function names in your library. Perhaps something like:

[codebox]

gpu_sgetrf.o: In function `gpu_sgetrf(int, float*, int, int*, int&, bool)':

gpu_sgetrf.cpp:(.text+0x50e): undefined reference to `sgetrf’

gpu_sgetrf.cpp:(.text+0x6c0): undefined reference to `strsm’

collect2: ld returned 1 exit status[/codebox]

To precise my changes, I add following my LIB: into my Makefile : -L/usr/lib -L. -llapack -lblas -lf2c

avidday · March 1, 2010, 10:25pm

That is a linking error, not a compilation error. I think libguide is part of MKL (or Intel’s threading primatives). If you aren’t using MKL, you shouldn’t be linking with it, you should be linking with the necessary libraries for the BLAS and LAPACK implementation you are using.

JulienECE · March 1, 2010, 10:27pm

Moreover, Despite that I added prototypes, I have these two errors:

[codebox]gpu_sgetrf.o: In function `gpu_sgetrf(int, float*, int, int*, int&, bool)':

gpu_sgetrf.cpp:(.text+0x50e): undefined reference to `sgetrf’

gpu_sgetrf.cpp:(.text+0x6c0): undefined reference to `strsm’

collect2: ld returned 1 exit status[/codebox]

To precise my changes, I add following my LIB: into my Makefile : -L/usr/lib -L. -llapack -lblas -lf2c

avidday · March 1, 2010, 10:35pm

As I pointed out in my first reply, you have to modify the function calls to sgetrf and strsm to match whatever is in the LAPACK you are using. I gave you a possible prototype use already.

hpc-matt · March 12, 2010, 2:15am

Vasily (and other reading this forum),

My name is Matt and I am using your code for QR decomposition. How does one go about extracting the Q matrix from your results? Normally, a call to SORGQR with the results from SGEQRF would be sufficient, however you do not store the v vectors in the input matrix below the diagonal. Any details concerning this would be much appreciated.

Thanks!

Matthew

Edit- This only occurs for non square matrices. Does your algorithm handle “tall” matrices?

vvolkov · March 13, 2010, 2:51am

You are right, this implementation is limited to factorization of square matrices.

Vasily