Calculate many SVDs of small matrices

Hi, i have a problem where i have to calculate the SVD of typically thousands of small matrices A independently. Each matrix A has 8 rows x 9 columns, maybe i extend it by one row to get 9x9. In my problem, I need the singular vector corresponding to the smallest singular value of A.

Are there any libraries available to do such kind of stuff (many threads, each does the SVD of a small matrix) ? If not, what approach could be good ? It should be relatively easy to implement, even if it’s not the fastest out there - maybe something similar as in…3-full-svd.html ? Power method may be useul, i’m not sure.

Alternatively, is there a way to re-formulate the problem mathematically so that the ‘many SVDS of small matrices’ can be re-cast into 'one SVD of a big matrix ? Advantage would be i could solve it with CULA or so. Either way, i think it wouldn’t make sense as the computational complexity would rise, i suppose.

Thx for any help, Hannes

This feature is on the road map for CULA. I don’t have an immediate time line, though. :(


From what I have read on this forum so far, this kind of a setup is rather atypical. Usually people are interested in operations on large matrices. One issue you might be facing is the cost of the data transport from the host to the GPU. Indeed, when the matrix is small and fixed in size, the cost of the SVD is linear with the volume of data you’re processing, and the cost of the data transfer is obviously linear with the data volume as well, so the gain (if any) in the performance of your SVD over the one run on the CPU might not necessarily justify the cost of the data transfer. In the standard setup (large matrices) the cost of the SVD is super-linear (data volume to the power of (3/2), if I’m not mistaken), and that power justifies the expense of data transfer.

To assess the performance gain resulting from running a GPU-based SVD in your setup, I would replace the GPU-based SVD with some operation of similar computational cost and which is easy to implement, e.g. self-multiplication of the matrix (maybe repeated a few times) followed by evaluation of the average element of such a product to return the bogus singular value. Only if that kind of a code yields a satisfactory speed up over the CPU-based code would I consider implementing the SVD.

Out of curiosity, what was the real-life problem that lead to such a setup?


We use similar matrix operations on many-small-matrices in machine learning / machine vision & image processing in those fields - as an example of one field where they’re used.

I can also think of applications in light transport simulation / compressible fluid sim / optimizing 3 dimensional geometric/physics algorithms in N (N>3) dimensional space where this would also apply… just to name a few.

As usual if you want to use GPU acceleration use it to large problem not the small parts.
So you do not need to transfer much of data (between host and device) and you get more acceleration (if solution code is GPU optimized).
I.e. try to move as much computation parts of the problem to GPU as you can and minimize data host-device transfer.

I’ve done similar work on QR-decompositions where I’m doing 1000’s of small QR factorizations in parallell.

And it definetly seems to be worthwhile even though the matrices aren’t very large. Even on my old GTS 250 I could see a 15x speedup over lapack running on a modern CPU.

@cudesnick, our application area is in this case signal processing.

Hi, I just saw that I have some replies to my post :-) Didn’t expect it after i didn’t get a reply for a couple of days…
Thx to all.

@kspagnoli: It’s good to hear that this feature is on the road map of CULA. I think it is not easy implementable, so good luck. I suppose it works only reasonably well (=significantly faster than on CPU) with the Fermi Architecture which has L1/L2 cache. Note that this kind of stuff has a couple of applications in machine vision. When you do a robust estimation of ‘something’, one way is to calculate a lot of hypotheses and take the one which is the best supported.

@ddmk: I think its OK to solve also ‘small problems’ (like the SVD of a 8x9 matrix) on the GPU, if you have a lot of them (thousands), they are indpendent of each other and do more or less the same thing (do not diverge ‘too much’).

@All: The problem is called ‘estimation of fundamental matrix’ (computer vision) - some sort of camera calibration. You have two camera images (stereo setup) and try to find the so-called ‘fundamental matrix’ which relates the screen-space-coordinates on the left image with the coordinates on the right images. For that, you first have to find ‘corresponding points’ in the two images. Then you do some kind of random sampling called RANSAC where you calculate a lot (thousands) of ‘hypotheses’ and take the hypothesis which is supported by the most samples (has most ‘inliers’). For each random sample, which is in fact a set of 8 point correspondences, you can calculate uniquely its Fundamental Matrix by doing the SVD of a 8x9 matrix and some ‘post-processing’ (enforce rank 2 of matrix) - the algorithmus is called ‘eight point algorithm’. A good reference in this field is Hartley & Zisserman, “Multiple view geometry”

@petterson: good work!

Hi @HannesF99 I recently run into the same problem as yours. I got thousands of small matrices (16 x 3), and I want to svd them in parallel. Is it possible to do svd for a single thread?
I know it’s been a few years, but can I ask did you work it out?
Thank you.

It’s a good news. Have they implemented it?

Hi I think for 16 *3 case, one can reduce it to 3 * 3 case and use a hard coded svd per thread to solve the problem! I just replied to another post of my thought on these problems Have you solved your problem already? If so how did you managed to solve it in the end?