Simple SVD for CUDA

Hello all,

I am new to CUDA and I am doing a research project to compare the power of GPU computing to the CPU for 3D reconstruction. The main algorithm I have to focus on is Singular Value Decomposition. I have searched up an down for SVD implemented using CUDA or CUBLAS but have yet to find anything. I attempted to take a step by step approach to writing my own but am stuck on how to implement the eigen values and vectors.

I am hoping that perhaps there is someone who has a simple CUDA SVD , nothing too fancy, that they wouldnt mind letting me use to do the main part of my project which is the benchmarking. OR someone who knows enough to help me get my code where it needs to be.

Thanks in advance!!

There is a technical report, “LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs” that talks about GPU LU and QR decomposition, and GPU-specific issues in terms of performance. Hope it helps.

http://www.eecs.berkeley.edu/Pubs/TechRpts…ECS-2008-49.pdf

Yao

A 512 x 512 SVD on CUDA code, one-side jaccobi method, optimized memory access for GPU, faster than Intel MKL SGESVD, slower than SGESDD.
A bidiagnalized input is suggested for better accuracy, output is U and W*V.
ZhangShu, DouHeng, supplementary issue of <>,ChengDu, 2009.7
cusvd_by_ZhangShu.rar (338 KB)

In their timings they don’t include mem transfer.

I don’t understand WHY people don’t include this, as it is something you NEED to do if you want to use the GPU. Just keep that in mind when looking at the results.

Well just as an example, people might generate the input to the SVD on the GPU. I always prefer it when both performance numbers are shown, the ones including & excluding transfer.

As a side note: Anybody have a gaussian elimination version for CUDA lying around? Now I let matlab calculate the inverse of a matrix I generate in CUDA & use the inverse in CUDA again, but I do not need the inverse in my algorithm.

At one point in the summer this would have helped me too. I wouldn’t mind taking a look at some code of Gaussian elimination, but don’t need it for anything.

I get sad when I realize that I enjoy looking at linear algebra and related code now <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

I have just been reading about FLAME and there are some working notes suggesting that they have working code for CUDA for these kinds of operations. Tomorrow when at work I’ll dig deeper. I already had a CUDA algo in my head, but am afraid that is it quite sub-optimal ;)

No, mem transfer is included in timings (if you are talking about my tech report).

Hi,

Did you have any luck in implementing SVD eventually… I am working on a similar project … Would be great if I could get some input…

Thanks in advance!!

Did you check CULAtools? They seem to have SVD: http://www.culatools.com/versions/basic

There was also a relevant paper “Singular value decomposition on GPU using CUDA” by Lahabar and Narayanan in IPDPS’09.

Yes, here is the paper. Really good read.

I’ve tested the culatools SVD and compared to some of the times shown in the paper. Culatools seems to give the same or slightly better results for larger sizes even though I have a GTX260 while they use a GTX280 in their paper…
SVD.pdf (177 KB)

Oh yeah, i’ve also had a look at Mr. Volkovs gpu_lapack, very cool stuff!

Thanks for the prompt response guys…

The thing is … I have written my own C code for QR factorisation and I am nearly done with my C code for SVD using QR factorization … So I am trying to implement this C code in CUDA… But I am running thru too many issues … even after thoroughly goin thru the programming guide and several examples… So I was wonderin, if there is a code out there that does not use the LAPACK library, which could be of some assistance… I am running short of time guys, I would really appreciate this…

Many thanks…

Hmm… check Mr. Volkovs first post here http://forums.nvidia.com/index.php?showtop…&pid=573376

Ther is a link to his gpu_lapack code there…

Jim, Could you tell what is the best way I can write the following piece of code on the device…

for(i=k;i<=Row-1;i++)

{

tempp=i*Row+j;

Q[tempp]=Q[tempp]-2.0temptR[i*Col+k];

}

I’ve tried putting idy in place of Col, but that doesnt work … I would like the above code to run in parallel as tempp changes… Any help would be really appreciated…

Thanks in advance…