I’m currently faced with almost the same problem but with size n*20 where n is from 32 to 4096. Some googling and search on this forum shows that generally this is a hard problem. Though this thread seems been dead for a month now, I will still post what I have in mind right now.

Before I discuss the ways that I thought of, note that in my case I can reduce my problem to finding the eigenvalue/vectors of X = A*A^H by using batched multiplication to reduce the problem size to 20 * 20(of course I don’t have to, but this leaves us with more room for different models/algorithms).

First, cublas now has “batched operation” for qr decomposition, matrix inversion and matrix multiplication. So maybe we can write a batch svd routine on top of these batched routines by qr iteration or orthogonal iteration. Problem are: 1. I’m not sure how performant the batched mode of cublas are as I have not tested them; 2. using a raw qr iteration leaves us a slower algorithm than lapack/MKL’s svd(I think lapack uses implicit qr which should be faster, plus they are more optmizied); 3. It’s hard to test for individual convergence since we have to batch all the qr/multiplications at one call. But if the individual problems are similar(which is the case for me), I don’t mind to run a few more iteration until all of the problems converge.

Second, another way is to implement a svd entirely with cuda c code in one matrix per thread approach. This is, from coding perspective, doable since for example we have http://www.oocities.org/hjsmithh/download.html#Algo358 as guidance for implicit qr iteration. But the problem then would be coalesce memory access(ref. http://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/). I don’t think I can fit all the matrices into shared memory in this case for we have each thread a matrix, and even if in a 8*8 case that will give us totalNumThread*8*8 = 64*totalNumThreads >> 48K of shared memory on chip, please correct me if I’m wrong since I’m not so sure about the numbers.

Third way is to let a block of threads working on a one matrix, so that’s a matrix per thread block approach. We can use paralleled jacobi method for finding the svd, I think there are some resource out there for jacobi method. I haven’t looked deeper into this and I hope someone can give some input on this method(specifically paralleled jacobi svd with cuda, since I do want take optimizing memory access into account).

Fourth way is as this post http://stackoverflow.com/questions/17401765/parallel-implementation-for-multiple-svds-using-cuda suggested, use a stream to batch the operations. I think this is the easiest one to try out but as some forum threads had suggested https://devtalk.nvidia.com/default/topic/821174/?comment=4498655, it only shows disappointing results.

So to summurize, I think for my problem size the best order of hope would be third option > first > second > fourth, considering together the level of difficulty of implementation and the gain it would bring. And for @delio your problem size, may be reduce the problem to A*A^H and code a matrix per thread(second way) would work? I’m not sure what do you think?