LU, QR and Cholesky factorizations using GPU

Vasily,

Congrats on the Good work!

I have a rather naive question.

I see you have used single precision for the calculations.

Is it “ok” to have solvers written in single precision?
Is double precision absolutely necessary?
Appreciate any help.

We ourselves are looking to GPUize a L-D-L(T) factorization. The questions are from that context. FYI.

Thanks,

Best Regards,
Sarnath

In fact, I’m not sure. I started working on sparse solvers quite recently - should have a better idea in a few months ;-)

It depends on the accuracy you need. In some applications single precision is sufficient.

Also, if condition number is small enough, you can get full double precision accuracy in Ax=b using single precision factorization and iterative refinement.

Vasily

Hi,
Is there a way to use the ATLAS package instead of mkl/VSL combo ?

For the random number, the standard can be used.

Sure, this should be straightforward. There are only a few LAPACK/BLAS routines that I call in MKL.

Vasily

Thank you, Vasily!

People use Givens rotations to compute QR for the HPEC Challenge benchmark because the benchmark specification requires it. Specifically, it requires the “fast Givens” computation that reduces the number of square roots by an order of magnitude.

The “fast” method incurs many dependencies, and reciprocal square roots are easily parallelizable on GPUs. You might be able to come up with an efficient strategy for rotating blocks using Givens rotations, but the well-known method of parallelizing it requires global synchronization.

Good luck!

Thank you for your reply.

I have tried different methods of Givens QR factorization, they all failed to get good performance on GPUs.

I came up with some formal analysis to explain why they fail.

I will post my analysis very soon.

Now that downloads are working again, I finally was able to get a copy of Vasily’s code and try it out on some new hardware I just got hold of, and the results are, well, interesting to say the least.

I built the code with ACML 4.3.0 and gcc 4.3, using a 3.0Ghz Phenom II 945 on a 790FX board with 8Gb of DDR3-1333 CL9 memory and a pair of GTX275s, running Ubuntu 8.09 with CUDA 2.3 and 190.18 drivers.

The performance of the hybrid solvers is impressive - they all peak over 300 single precision GFLOP/s per second. But look at a couple of the results (suspiciously powers of two). At first I assumed I had some really dodgy PCI-e bandwidth issues or something. It wasn’t until I ran the cpu only versions of the benchmarks that it became obvious that it was the CPU side library that was causing the problems. And look at what a misbehaving host library can do to the performance! It has been several years since I tried ACML for anything serious, and I must say it has improved a lot, but the variability in performance is just awful.

The moral of the tale - know thy host library. And thanks to Vasily for doing the hard thinking and making such great code available to the rest of us.

Hi Avidday. Your results are interesting. To what do you attribute the sharp dip in performance, in both graphs, at around 8,000?

Malcolm

There are actually several dips, although the one at N=8192 is by far the worst. It is definitely something in the ACML BLAS or LAPACK implementation (probably SGEMM but I haven’t done anything to confirm it yet).

Good morning,

      I working on a computational finance project and I need an [b]LU decomposition[/b] ([b]factorization[/b]) implementation on CUDA.

I see Vasily Volkov implementation but I try to use it without mkl library and it doesn’t work. Some functions don’t appear in LAPACK library and volkov library.

I found a cuda library that implement LU Decomposition named Cula (http://www.culatools.com/functions) but in basic version, the LU function (culaDeviceSGETRF => Computes an LU factorization of a general matrix, using partial pivoting with row interchanges) doesn’t work for me (Data error).

I extremely need LU decomposition to solve a linear system !

Thanks for all your answers.

Julien

Can you be more specific in saying what doesn’t work?

I used some Intel-specific random number generator for testing purpose, but otherwise you should be able to plug in any other LAPACK implementation like ACML, possibly adopting to its calling conventions. This should amount to changing ~2 lines in gpu_sgetrf.cpp.

I tried to make the code as-easy-to-read-as-possible rather than as-easy-to-compile-as-possible. Sorry about that :-P

Vasily

I use LAPACK and I substituate your randomly matrix filling function by random function from stdlib.

And when I compile, I have theses errors:

[codebox]gpu_lapack/gpu_sgetrf.cpp:76: error: ‘sgetrf’ was not declared in this scope

gpu_lapack/gpu_sgetrf.cpp:103: error: ‘strsm’ was not declared in this scope[/codebox]

Thank you for your answers

You need to add prototype declarations for sgetrf and strsm which match the function names in your library. Perhaps something like:

extern "C" void sgetrf_(int *m, int *n, float *a, int *lda, int *ipiv, int *info);

extern "C" void strsm_(char *side, char *uplo, char *transa, char *diag, int *m, int *n, float *alpha, float *a, int *lda, float *b, int *ldb);

might work if you are using the netlib lapack and g++/gfortran.

[quote name=‘avidday’ post=‘1010329’ date=‘Mar 1 2010, 09:39 PM’]

You need to add prototype declarations for sgetrf and strsm which match the function names in your library. Perhaps something like:

[codebox]

gpu_sgetrf.o: In function `gpu_sgetrf(int, float*, int, int*, int&, bool)’:

gpu_sgetrf.cpp:(.text+0x50e): undefined reference to `sgetrf’

gpu_sgetrf.cpp:(.text+0x6c0): undefined reference to `strsm’

collect2: ld returned 1 exit status[/codebox]

To precise my changes, I add following my LIB: into my Makefile : -L/usr/lib -L. -llapack -lblas -lf2c

That is a linking error, not a compilation error. I think libguide is part of MKL (or Intel’s threading primatives). If you aren’t using MKL, you shouldn’t be linking with it, you should be linking with the necessary libraries for the BLAS and LAPACK implementation you are using.

Moreover, Despite that I added prototypes, I have these two errors:

[codebox]gpu_sgetrf.o: In function `gpu_sgetrf(int, float*, int, int*, int&, bool)’:

gpu_sgetrf.cpp:(.text+0x50e): undefined reference to `sgetrf’

gpu_sgetrf.cpp:(.text+0x6c0): undefined reference to `strsm’

collect2: ld returned 1 exit status[/codebox]

To precise my changes, I add following my LIB: into my Makefile : -L/usr/lib -L. -llapack -lblas -lf2c

As I pointed out in my first reply, you have to modify the function calls to sgetrf and strsm to match whatever is in the LAPACK you are using. I gave you a possible prototype use already.

Vasily (and other reading this forum),

My name is Matt and I am using your code for QR decomposition. How does one go about extracting the Q matrix from your results? Normally, a call to SORGQR with the results from SGEQRF would be sufficient, however you do not store the v vectors in the input matrix below the diagonal. Any details concerning this would be much appreciated.

Thanks!

Matthew

Edit- This only occurs for non square matrices. Does your algorithm handle “tall” matrices?

You are right, this implementation is limited to factorization of square matrices.

Vasily