In fact, I’m not sure. I started working on sparse solvers quite recently - should have a better idea in a few months ;-)
It depends on the accuracy you need. In some applications single precision is sufficient.
Also, if condition number is small enough, you can get full double precision accuracy in Ax=b using single precision factorization and iterative refinement.
People use Givens rotations to compute QR for the HPEC Challenge benchmark because the benchmark specification requires it. Specifically, it requires the “fast Givens” computation that reduces the number of square roots by an order of magnitude.
The “fast” method incurs many dependencies, and reciprocal square roots are easily parallelizable on GPUs. You might be able to come up with an efficient strategy for rotating blocks using Givens rotations, but the well-known method of parallelizing it requires global synchronization.
Now that downloads are working again, I finally was able to get a copy of Vasily’s code and try it out on some new hardware I just got hold of, and the results are, well, interesting to say the least.
I built the code with ACML 4.3.0 and gcc 4.3, using a 3.0Ghz Phenom II 945 on a 790FX board with 8Gb of DDR3-1333 CL9 memory and a pair of GTX275s, running Ubuntu 8.09 with CUDA 2.3 and 190.18 drivers.
The performance of the hybrid solvers is impressive - they all peak over 300 single precision GFLOP/s per second. But look at a couple of the results (suspiciously powers of two). At first I assumed I had some really dodgy PCI-e bandwidth issues or something. It wasn’t until I ran the cpu only versions of the benchmarks that it became obvious that it was the CPU side library that was causing the problems. And look at what a misbehaving host library can do to the performance! It has been several years since I tried ACML for anything serious, and I must say it has improved a lot, but the variability in performance is just awful.
The moral of the tale - know thy host library. And thanks to Vasily for doing the hard thinking and making such great code available to the rest of us.
There are actually several dips, although the one at N=8192 is by far the worst. It is definitely something in the ACML BLAS or LAPACK implementation (probably SGEMM but I haven’t done anything to confirm it yet).
I working on a computational finance project and I need an [b]LU decomposition[/b] ([b]factorization[/b]) implementation on CUDA.
I see Vasily Volkov implementation but I try to use it without mkl library and it doesn’t work. Some functions don’t appear in LAPACK library and volkov library.
I found a cuda library that implement LU Decomposition named Cula (http://www.culatools.com/functions) but in basic version, the LU function (culaDeviceSGETRF => Computes an LU factorization of a general matrix, using partial pivoting with row interchanges) doesn’t work for me (Data error).
I extremely need LU decomposition to solve a linear system !
Can you be more specific in saying what doesn’t work?
I used some Intel-specific random number generator for testing purpose, but otherwise you should be able to plug in any other LAPACK implementation like ACML, possibly adopting to its calling conventions. This should amount to changing ~2 lines in gpu_sgetrf.cpp.
I tried to make the code as-easy-to-read-as-possible rather than as-easy-to-compile-as-possible. Sorry about that :-P
You need to add prototype declarations for sgetrf and strsm which match the function names in your library. Perhaps something like:
extern "C" void sgetrf_(int *m, int *n, float *a, int *lda, int *ipiv, int *info);
extern "C" void strsm_(char *side, char *uplo, char *transa, char *diag, int *m, int *n, float *alpha, float *a, int *lda, float *b, int *ldb);
might work if you are using the netlib lapack and g++/gfortran.
That is a linking error, not a compilation error. I think libguide is part of MKL (or Intel’s threading primatives). If you aren’t using MKL, you shouldn’t be linking with it, you should be linking with the necessary libraries for the BLAS and LAPACK implementation you are using.
As I pointed out in my first reply, you have to modify the function calls to sgetrf and strsm to match whatever is in the LAPACK you are using. I gave you a possible prototype use already.
My name is Matt and I am using your code for QR decomposition. How does one go about extracting the Q matrix from your results? Normally, a call to SORGQR with the results from SGEQRF would be sufficient, however you do not store the v vectors in the input matrix below the diagonal. Any details concerning this would be much appreciated.
Thanks!
Matthew
Edit- This only occurs for non square matrices. Does your algorithm handle “tall” matrices?