I’ve seen some posts here about people using LAPACK with cublas, how easy is it to do this? I imagine you can’t just link LAPACK against cublas because the function names are all different, could you just make some stubs for BLAS that call cublas and link LAPACK against that?
If that would work is it worth doing, or does having LAPACK not on the gpu counteract the speed up from using cublas? I expect this would mean you end up doing a lot of transfers between the cpu and the gpu every time lapack calls blas which might slow things down too much?
I presume nvidia are working on doing this anyway, so I don’t want to spend too much time on it, unless they definitely aren’t doing it.
It isn’t all that complex to write C wrappers (or use Flagon from Fortran) to replace selected host BLAS calls with CUBLAS calls in a custom BLAS which can be linked with LAPACK. I have also done the same with SuperLU with some success. I found putting some very simple size heuristics inside the wrapper to select whether to use host or device BLAS works reasonably too. There is a paper by Massimo Fatica from NVIDIA which describes a “hybrid” strategy using host and GPU BLAS simultaneously to accelerate LINPACK which is probably worth reading as well.
It’s not a very good strategy to run LAPACK on top of CUBLAS, because while the idea behind LAPACK is that it can fall back to using the lower-level BLAS functions if necessary for a certain matrix, having to transfer data back and forth between the host and GPU is less than ideal (which is what will happen if you use it with CUBLAS).
Depending on what algorithms you need to run on the GPU, someone may have already released a custom version; for example, vvolkov released his matrix factorization code (search this forum and you’ll find it), and it has near-ideal performance. For some other specialty transforms, I’m afraid the best way is just to write your own GPU version of it, so that you can maximize the amount of time that the algorithm spends working with the data on the GPU.
i thought LAPACK was supposed to rely pretty heavily on BLAS, although i expect it depends pretty heavily on what you are doing with it. for now i just need a matrix eigenvalue solver but unfortunately the iterative methods (i think that is what the one in the sdk is) don’t work well for the kind of matrices i am interested in, whereas the lapack ones do.
having looked at that linpack presentation i think i will just try it out and see how it compares to doing it all on the CPU. using a heuristic to decide whether to send things to cublas or a cpu version of blas is a good idea.
That is correct. LAPACK does work on top of BLAS. The problem is that it is designed to run on a CPU, so the various checks it does for iterative methods don’t work well when it’s used with CUBLAS. To avoid repeated overhead in each iteration (and slowing down your computing!), it’s better to write a method specifically for the GPU.
I’m surprised no one mentionned MAGMA so far. It’s meant to be GPU-enabled LAPACK library. Seriously, it’s pretty naive to believe that you will ever get performance for free by linking LAPACK with CUBLAS transparently, it would be nice, but this is way ahead of the state-of-the-art research :)