many small eigen/singular value decomposition?


I’m a kind of newbie in parallel programming and CUDA environment.

My problem has a lot of small eigen and singular value decompositions(EIG/SVD), so I think it would be accelerated by using parallel processing using CUDA. However, it is very hard to find those basic numerical algorithms running IN A THREAD.

Is it because it is not efficient, or just because I cannot find one?

Any idea is appreciated.

Thanks in advance.

I don’t completely understand your question. When you have a lot of EIG/SVD to do, you can have each thread do one of them. Then you can just re-use the normal code. I don’t know if there are parallel versions of these algorithms.

A normal version of EIG/SVD is running in a single thread.

Thank you for your prompt reaction.

I wonder if there is some available implementations of EIG/SVD running in each thread. In serial version of my implementation, I used LAPACK library to do that.

Do I have to reimplement EIG/SVD functions for threads in CUDA? I don’t think I can use a public linear algebra library such as LAPACK in threads.

I have implemented 512x512 SVD on CUDA…all threads for one matrix
a svd could be implemented on blocks using block jaccobi algorism…
one svd on each thread using bidiagnal-QR is possible, but I do not recommend that.

Hmmm. In my case, I have thousands of svd problems with 10 x 10 (small) matrices.

Is there any reason not to recommend running a small svd routine in a thread?

I already tried to do that, and it works even though it is not efficient because it uses too many registers in the routine.

In my algorism, 8 x 512 SVDs are parts of 512 x 512 SVD in step 1. I am very sure it is possible to apply a 10 x 256 svd with 5 blocks, easy to convert to 25 10 x 10 SVDs , and within shared mem and register restrain. I have already done 8x8 SVDs before, with 256 threads in each block, 4 blocks 7 loops for each svd. I am very sure it is possible, Good luck!

10x10 is not a good size for GPU…
I used global for data-exchange between blocks, for smaller size like 10x10, it is better to read the matrix in the form of 16 x 10, use 80 threads…( but 80 could not be devided by 32), so in a 256 thread block ,you can do 3 10 x10 svds at the most…that would use about 2k shared mem and less than 16 registers…