Singular Value Decomposition (SVD)

[FONT=Courier] I am trying to implement SVD in CUDA. I am new to programming in CUDA and have done some research on parallel SVD computations but I am not quite sure of the best approach for CUDA programming. Has anyone implemented SVD code in CUDA or does anyone know what kind of improvements I should expect by doing the SVD in CUDA?


I have implemented SVD on CUDA using my improved block jaccobi method.The same algorism run on CPU is 5-8 times slower than GPU. But GPU does not have big improvement in speed comparing to intel MKL SGESVD/SGESDD, and the accuracy is a little worse than CPU libs.

I’m still working on it, I don`t think there will be big accelaration on 1 GPU. Good news is that jaccobi based algorisms are parallel, and my improments made it possible to have parallelism on threads, blocks and GPUs.

can u share ur svd code on cuda? thnx in advance

Check out CULA’s SVD routine. For larger sizes it’s much faster than MKL.

“As of June 14, 2013, CULA will no longer be offered as an individually licensed product.”

There is also MAGMA (, but there are no install packages for Windows, so I need to wrestle with a Fortran compiler… uuugh.

Is there anything else out there?

Hello Guys,

Even I am trying to use the CUDA SVD, but as soon as I increase the size of matrix to greater than 90x90 then the value of all entries in matrix ‘S’ becomes 0 where SVD of A is given by A = USV’
I am using the function F.2. SVD with singular vectors (via Jacobi method).
Any help is appreciated it is for our course project on Dynamic Mode Decomposition (for which we need SVD)


I could solve it using the cusolverDnDgesvd function, for double precision.


In Cuda 8 cusolver SVD is slow. Magma is faster:

Andrzej Chrzeszczyk

…I had the same experience. Comparing cusolver SVD to armadillo I found armadillo to be much faster. But that was unsurprising as my matrix size is really small. Now switching to IntelMKL