LAPACK + CUBLAS

I have 1000 relatively small matrices (size 9x8) on which to perform
a singular value decomposition.

I guess the procedure for bringing the computation to the gpu would be
as:

->Load all 1000 matrices to gpu
->Run a lapack routine in a kernel (or as a kernel)
->bring back the results to CPU

1.) In what way could cublas + standard lapack make this computation faster?
2.) Are there plans to provide a CUDA-LAPACK Library?
3.) I guess when using a standard Lapack library upon CUBLAS
my problem wouldn’t get solved much faster!?

Regards R4DIUM

Just to reply with what I know - but I don’t know much…

The cublas are a set of subroutines specialized to CUDA - I don’t believe they can be used in place of BLAS (and the set of BLAS is not yet complete anyways). So you could not use LAPACK, and have the LAPACK routines call the cublas, I don’t believe.

There are also no LAPACK routines, (i.e., no SVD routines) available that use CUDA; I suspect that that is in the works, however. LAPACK is such an obvious thing, I can’t imagine that Nvidia would ignore the matter.

Also, your small SVD’s may not even be suitable for CUDA - generally you need to have a somewhat large-sized problem for CUDA to start to be worthwhile. The main hang up is host<->device communications. If your small SVD’s were somehow operating on the same data, that might be a different matter. Then your calculations would be mostly on the GPU, with minimal data transfer overhead.

I’ve bugged these lists a few times about matrix inversion, or the solution to A*X=B, and have been met with just silence…so even that is not yet implemented. But I suspect (hope?) it is in the works; I know people are working on it, if not at Nvidia.

Does the gained speed decrease once the problem passes a certain size?

Let’s say I have a 50000x50000 matrix I want to find eigenvalues/vectors on. Even in float, that’s going to be about 10GB, well in excess of GPU memory. LAPACK is still working on the problem after 3 days; would a CUDA implementation be any faster in that case?

I speak as a 2-week old “expert”…

If you just tried to naively run such a problem, it would likely just crash - so far it seems to me that CUDA is not that elegant in how it handles a calculation that runs out of memory. (And your desktop screen would likely behave badly as well, if you were using the card for video as well.) I know of no GPU that has 10 GB of RAM!

If you could code up the problem to feed manageable amounts of data to the GPU to a successful conclusion, I suspect CUDA would perform quite well compared to trying to do the calculation on the desktop (particularly if the desktop had to resort to swap space!). But I could offer no clues as to how to code up an eigenvalue problem this way.

Thanks, that was my understanding as well.

DO NOT USE LAPACK’s ALGORISM
I have solved SVD on CUDA, DO NOT use bidiagnal-QR algorism, use jaccobi!
It is possible to solve 9x8 SVD on blocks, and I have already done that last year.
It’s easy to get a good speedup with this method

for svd, there is solution.

for eigen values, by applying divide and conquer method, it is possible.