Batched SVD?


I need to compute the (thin) SVD of one million 25x8 matrices. Essentially I need each thread to perform a SVD. I’ve read that these kind of operations can be done using “batched” versions of the routines implemented by cuda libraries (

Is there any CUDA library which implements a batched SVD? Or is there any open source library which does this?

Thanks for your help

I’m currently faced with almost the same problem but with size n*20 where n is from 32 to 4096. Some googling and search on this forum shows that generally this is a hard problem. Though this thread seems been dead for a month now, I will still post what I have in mind right now.

Before I discuss the ways that I thought of, note that in my case I can reduce my problem to finding the eigenvalue/vectors of X = A*A^H by using batched multiplication to reduce the problem size to 20 * 20(of course I don’t have to, but this leaves us with more room for different models/algorithms).

First, cublas now has “batched operation” for qr decomposition, matrix inversion and matrix multiplication. So maybe we can write a batch svd routine on top of these batched routines by qr iteration or orthogonal iteration. Problem are: 1. I’m not sure how performant the batched mode of cublas are as I have not tested them; 2. using a raw qr iteration leaves us a slower algorithm than lapack/MKL’s svd(I think lapack uses implicit qr which should be faster, plus they are more optmizied); 3. It’s hard to test for individual convergence since we have to batch all the qr/multiplications at one call. But if the individual problems are similar(which is the case for me), I don’t mind to run a few more iteration until all of the problems converge.

Second, another way is to implement a svd entirely with cuda c code in one matrix per thread approach. This is, from coding perspective, doable since for example we have as guidance for implicit qr iteration. But the problem then would be coalesce memory access(ref. I don’t think I can fit all the matrices into shared memory in this case for we have each thread a matrix, and even if in a 88 case that will give us totalNumThread88 = 64totalNumThreads >> 48K of shared memory on chip, please correct me if I’m wrong since I’m not so sure about the numbers.

Third way is to let a block of threads working on a one matrix, so that’s a matrix per thread block approach. We can use paralleled jacobi method for finding the svd, I think there are some resource out there for jacobi method. I haven’t looked deeper into this and I hope someone can give some input on this method(specifically paralleled jacobi svd with cuda, since I do want take optimizing memory access into account).

Fourth way is as this post suggested, use a stream to batch the operations. I think this is the easiest one to try out but as some forum threads had suggested, it only shows disappointing results.

So to summurize, I think for my problem size the best order of hope would be third option > first > second > fourth, considering together the level of difficulty of implementation and the gain it would bring. And for @delio your problem size, may be reduce the problem to A*A^H and code a matrix per thread(second way) would work? I’m not sure what do you think?

Hi all,

facing a similar problem now, I would have a a lot of complex valued 16x32 SVDs to process in parallel.

Is there any library that can already map this problem to CUDA efficiently?

UPDATE: I found this thread stating that the cuSolver library has a batched SVD mode - though the OP was complaining about its lack of performance for very small matrix sizes.

I also just found a relatively new library KAUST BLAS (originating from Saudi Arabia) that claims to have batched SVD in CUDA