Batched SVD?

Hi,

I need to compute the (thin) SVD of one million 25x8 matrices. Essentially I need each thread to perform a SVD. I’ve read that these kind of operations can be done using “batched” versions of the routines implemented by cuda libraries (http://www.culatools.com/blog/2011/12/09/batched-operations/).

Is there any CUDA library which implements a batched SVD? Or is there any open source library which does this?

Thanks for your help
Delio

I’m currently faced with almost the same problem but with size n*20 where n is from 32 to 4096. Some googling and search on this forum shows that generally this is a hard problem. Though this thread seems been dead for a month now, I will still post what I have in mind right now.

Before I discuss the ways that I thought of, note that in my case I can reduce my problem to finding the eigenvalue/vectors of X = A*A^H by using batched multiplication to reduce the problem size to 20 * 20(of course I don’t have to, but this leaves us with more room for different models/algorithms).

First, cublas now has “batched operation” for qr decomposition, matrix inversion and matrix multiplication. So maybe we can write a batch svd routine on top of these batched routines by qr iteration or orthogonal iteration. Problem are: 1. I’m not sure how performant the batched mode of cublas are as I have not tested them; 2. using a raw qr iteration leaves us a slower algorithm than lapack/MKL’s svd(I think lapack uses implicit qr which should be faster, plus they are more optmizied); 3. It’s hard to test for individual convergence since we have to batch all the qr/multiplications at one call. But if the individual problems are similar(which is the case for me), I don’t mind to run a few more iteration until all of the problems converge.

Second, another way is to implement a svd entirely with cuda c code in one matrix per thread approach. This is, from coding perspective, doable since for example we have http://www.oocities.org/hjsmithh/download.html#Algo358 as guidance for implicit qr iteration. But the problem then would be coalesce memory access(ref. http://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/). I don’t think I can fit all the matrices into shared memory in this case for we have each thread a matrix, and even if in a 88 case that will give us totalNumThread88 = 64totalNumThreads >> 48K of shared memory on chip, please correct me if I’m wrong since I’m not so sure about the numbers.

Third way is to let a block of threads working on a one matrix, so that’s a matrix per thread block approach. We can use paralleled jacobi method for finding the svd, I think there are some resource out there for jacobi method. I haven’t looked deeper into this and I hope someone can give some input on this method(specifically paralleled jacobi svd with cuda, since I do want take optimizing memory access into account).

Fourth way is as this post http://stackoverflow.com/questions/17401765/parallel-implementation-for-multiple-svds-using-cuda suggested, use a stream to batch the operations. I think this is the easiest one to try out but as some forum threads had suggested https://devtalk.nvidia.com/default/topic/821174/?comment=4498655, it only shows disappointing results.

So to summurize, I think for my problem size the best order of hope would be third option > first > second > fourth, considering together the level of difficulty of implementation and the gain it would bring. And for @delio your problem size, may be reduce the problem to A*A^H and code a matrix per thread(second way) would work? I’m not sure what do you think?

Hi all,

facing a similar problem now, I would have a a lot of complex valued 16x32 SVDs to process in parallel.

Is there any library that can already map this problem to CUDA efficiently?

UPDATE: I found this thread stating that the cuSolver library has a batched SVD mode - though the OP was complaining about its lack of performance for very small matrix sizes.

https://devtalk.nvidia.com/default/topic/1042245/cuda-programming-and-performance/batched-svd-on-cuda-surprisingly-slow/

I also just found a relatively new library KAUST BLAS (originating from Saudi Arabia) that claims to have batched SVD in CUDA