Lanczos' Algorithm Shot in the dark

Does anyone by chance know a parallel version of lanczos’ algorithm? Looking at it, I can’t see how it can be parallel, but I might be missing something. If you don’t know what this algorithm is, go here . That is a pretty good definition of it. The problem with it is, each element in the loop is dependent on something previously calculated. I don’t really even see a way to precalculate anything for this. I did just notice that I am able to speed up the matrix multiplications inside of the loop. This should give me at least a little performance. Somebody let me know if I’m missing something here.

Two answers: first, you may not need to parallelize the algorithm but instead run the steps sequentially, and those steps internally use parallelization to speed up their computes. IE, you just call the CUDA BLAS routines which are already parallelized for you.

Second answer, yes, there is a parallel Lanczos algorithm, but I get the feeling that it may be an implementation challenge. Take a look at

Yeah, I mentioned that in my original post. The more I think about it, the more that this seems to be a useful way to use the gpu, as most of the calculation will be Dot Products and can be done using CUBLAS. The Funny thing is, I was reading that paper at the same time that I pulled up your reply :-P It seems like it might be best just to go ahead and implement the original algorithm and test that for performance increases compared to a CPU implemented version. It should be enough to be beneficial to my project at least. Thank you for the quick response :-)


So, with my matrix notation comprehension being very poor, what does the line Wj<----- AVj - BjV(j-1) represent? A is a matrix and V is some vector (which is maybe a vector in A at position j?) I would post this somewhere else, but not sure where else to do it.