LU factorization code

Hi !

I implemented an LU factorization using CUBLAS functions (but not fully parallelized algorithm) doing :

for (int k=0; k<mDimR-1; k++)


   // pivot = max L, k+1..mDimR

  int pivotRow = cublasIsamax(mDimR-k, L.getGpuDataPointer()+k + k*mDimR, 1); // row relative to the current submatrix

  pivotRow = pivotRow+k-1;

 if (pivotRow!=k)


  	cublasSswap(mDimC, L.getGpuDataPointer()+pivotRow, mDimR, L.getGpuDataPointer()+k, mDimR);

  	cublasSswap(mDimC, permute.getGpuDataPointer()+pivotRow, 1, permute.getGpuDataPointer()+k, 1);


 float valcheck;

  cublasGetVector(1,sizeof(float), L.getGpuDataPointer()+k+ k*mDimR, 1, &valcheck, 1);

  if (fabs(valcheck) < 1E-20)


  	cout << "This matrix is not inversible or too ill conditionned" << endl;



 cublasSscal(mDimR-k-1, 1./valcheck, L.getGpuDataPointer()+k+1+ k*mDimR, 1);

  cublasSger (mDimR-k-1, mDimC-k-1, -1., L.getGpuDataPointer()+k+1+ k*mDimR, 1, L.getGpuDataPointer()+k+ (k+1)*mDimR, mDimR, L.getGpuDataPointer()+(k+1)*mDimR+k+1, mDimR);


where mDimR is the number of rows and mDimC of colums. L is my variable from my Matrix class that I want to factorize and getGpuDataPointer() returns the (float*) adress of the array in the device. permute is a mDimR*1 matrix which contains at the end the order of swapped rows, and which initially contains 1:mDimR integers.

Most of the time in this code is spent in cublasSger(). The result is fine.

My question is : executing this code for large matrices, it is really slow compared to matlab’s lu() function. Matlab is able to compute a 40964096 LU factorization in about 3 seconds (or 10 seconds on my other computer) whereas this function takes more than 15 seconds for a 30723072 matrix.

Is my initial algorithm so bad ? Is my GPGPU version not ok ? (I didn’t test this same code on cpu)

I might do a full parallel version of the LU factorization but I’m not familiar enough with parallel computing for the moment so it could take me some time. If a semi-parallel version (like this code) could work at least as fast as matlab’s code, it would be good…

Thank you very much in advance !



I was trying to implement a QR - Factorization with cublas routines, but there are many issues to consider when implementing such algorithms for the GPU.

At first you are using a lot level-1 blas routines in a loop: You need to have vector sizes more than 5 MB (you need definetely more than 3000 elements) to have an appropriate speed-up. Have a look on this thread :…t=&#entry239276

Second you are using Get or Set Methods (host -> device , device -> host) in a loop: I found out that the data transfer rate depends on the data size (linear behavior) if you transfer vectors less than 200 K elements --> no peak 2.5 GB/s as one might think. Have a look on this thread:…t=&#entry238571

I also advice you not to work with cublas routines when dealing with “small” matrix sizes and level - 1 or maybe level -2 blas routines. It might be a a good start, but you should switch to the runtime api. You might have a speed up when implementing dot products by yourself.

Hope this was usefull for you.

greetz, cem

Thank you for your answer!

Do you consider that it worth it coding it completely in parallel rather than using this same algorithm using my own functions in the runtime api ?
I don’t catch the difference between using a cublas routine rather than coding this exact same routine with the api…

Moreover, the most time spent is in cublasSger which does A = A +alphaxy^t. The GetVector is getting only one element per iteration and is almost nothing…

And over all, is this algorithm optimal (between non parallel algorithms) ?
I checked again and 3.5 seconds was matlab’s time for a 20482048 matrix. For 30723072 matrices, it takes 11 seconds which is still faster than my implementation.

Thank you again

i am also trying to implement a QR factorization using CUDA. I have a working version based on blas routines but I think in the case of CUDA it would be much better to use fast givens rotations instead of householder transformations as the memory access patterns are better in the former. my only problem now is to determine the right givens ordering for optimal behavior.

what do you think? am i on the right track ? is it the way to go ?

thank you in advance.

I’ve checked the paper and the code of “LU-GPU: Algorithms for Dense Linear Systems on Graphics Hardware”
( paper at
code at : )
They use the same algo as I do but just use rasterizations instead of cublas. Could someone confirm me that it should result in exactly the same result in term of speed ? They claim to outperform CPU based algo…

For QR, Givens is usually slower than Householder algo. Housholder might be easier to completely parallelize, but if you want to use cublas, Givens rotations are already computed by cublas… You can also have a look at gram schmidt methods for that, but I don’t know how good they are.

I am not really sure about the givens rotations, because when I first implemented my QR decomposition with givens rotations it needed longer than the implementation with householders. (My first implementations were awful due to the device / host transfers. ) The routines for applying givens rotations are level - 1 blas and they are only efficient when you have very large vectors. Without any doubt, you can definetly outperform the CPU, but as I said, you need veeeeery large vectors. Just measure the time of the cublasSrot() against a non-built in function in matlab for instance. You will see that the speed up is above one after your vector size is larger than 5 * 10^4 approximately ( I am not sure about the number but it must be around x * 10^4, x \in [1:10] ) . If you use Set and Get Methods, transferring data from device to host memory or vice versa within your loops (even if its only a single element), the vector size must be even greater. That is why I advice you not to use the cublas routines if you have a “small” amount of data to work on.

However, you should try to implement it in both ways and compare the runtimes. Depending on your data size, you might need to implement your algorithms with the runtime api, to keep your data within the shared memory. If you have transfers (device <-> host) within a loop, I think you do not have any other choice.


A friend who worked much in LU parallelization told me to use blocked LU factorizations.
That’s what I finally did, and it is muuuuuuch faster !

With block sizes of 64, I get the following results/speedups compared to matlab lu() code… :

The GPU version begins to worth it for matrix sizes over nearly 700*700 (on my computer…).


Did u work with cublas ? and which levels ?

Did u have set and get methods within loops ?

Also there is a possibility to work block wise in the qr factorization !

thx for the results !

I work with cublas only (no pure cuda code). It reuse the code I posted above with theses level 1 and 2 calls, and it does the blocked version using level 3 blas calls (solving matrix systems and sgemm) like the lapack routine.

I have a really few gets (one get of one float per column).

The problem I had is that there is no DSWAP in cublas, only SSWAP, so I had to do a loop over rows which might not be optimal.

The blocked routine is mainly inspired by DGETF2.

My friend told me blocked QR were much less easy… I’ll take a look on that, but I’ll do cholesky before…

I also have another problem : I cannot use matrices greater than 4096² which is 64Mb whereas I have a GeForce 8800GTS… why the hell ? :s

r u using your grafic card as primary display device, if so, switch to your on baord grafic card (!)

Okay I will try my qr factorization with blocked qr algorithms. For me now, I am working with really small matrices (128 x 128), thus any get set would kill the performance

thnks for the hint!

Maybe u said it before : speed up only because of the blocking ?


I don’t understand… my 8800 cannot cope with a 16001200 screen and a 40964096 texture ? :s

YES ! :)

with non blocked version, there is no hope to get any improvement compared to matlab [edit : which does use a blocked version) (although I only tested until 3072*3072 matrices).

Read the CUDA_Release_Notes_1.0.txt in your CUDA folder, I think that this problem is mentioned in there.

nice, this is also what i found out: cublasSgemm provides a speed - up above one for matrix dimensions greater than 500 and plus.

Really nice ! :))

mmm… I didn’t find the file with this exact name, but the other release notes related text files don’t seem to tell about that…

…and… I feel ashamed but… I don’t know how I can switch to my onboard graphic card… I took a look at the nvidia panel and my display configuration but didn’t find anything…

okay, when you navigate to display properties and then to the tab settings you should be able to change your settings.

it is definetly not under the nvidia panel.

I found how to deactivate hardware acceleration, but when I do that, CUDA initialization fails…

(I finally found the release note, but it just says about maximum time of a single program… for a 5000*5000 matrix, it should be far less than 5 seconds).

You don not have to deactivate your hardware I guess, but just put your nvidia graphic card not as the primary display device!

This is enough i guess.

mmm… I still didn’t find how to do that :s anyway…

This is how i changed it, but i am not sure if it is the right way

You are able to do that because you have 2 screens, but I have only one.

I guess in the release note they assume we have 2 NVidia graphic cards, one for the screen, one for cuda…

I still have my 7800GTX… maybe I should reinstall it ?

I’ve already wrote one version of parralel lu decomposition. I use starightway global memoty acess and I’ve got 350Mflops on my 8600gt due to unoptimized memory read. Now I’m woking on improved version.