LDM^T Dense Solver on GPU Intermediate Results... NEED HELP!


I just ventured into Solver acceleration. I have implemented the LDM^T factorizer in GPU (only the factorization).
I understand the importance of factorization and the algorithm that goes bhind it. And, thats about it. No practical application experience.

The sequential algorithm for LDM^T can be found in “The Matrix computations” book by Van Loan & Golub
Page 137, Algorithm 4.1.1

I am getting pretty good speedups (8x for 1000x1000 (~340 ms) to 133x for 5000x5000 (13 secs) on a TESLA C1060 compared to 2.41GHz AMD Athlon)

But my problem is precision. The diagonal elements get pumped up pretty soon and they cumulatively amplify small errors into big ones - well, Thats my reasoning (I hope it is not a software bug - I did some verification - see next line).

I checked my application with a minimal 20x20 matrix and the max error was 2.xxx… The original matrix contained random elements from 1 to 100. I dumped the entire matrix just before calculating the max-error element and found that entires were pretty accurate… except that the big diagonal elements are amplifying miniscule errors for subsequent calculations. I actually wrote a C program to validate this. Please find it in attachment. The “par” array is the GPU based array just before calculating the [17][17] diagonal and “desi” array is the CPU based array just before doing the [17][17] one. “par” and “desi” array values are very close to each other and yet [17][17] calculation differs by 2. I also print the diagonal elements out.

So, my questions are:

  1. What kind of pre-conditioners need to be applied before doing a LDM^T?
    At the momment I am randomly generating a matrix with entries between 1 and 2. (1 to 100 for the test application in the attachment)

  2. Are there huge matrices available in the internet for benchmarking? I remember some1 posting something about it sometime back.

  3. I think all solvers need to address this floating point precision problem. So, What is the industry standard for it?
    Is that why Iterative solvers are in place?

  4. By question 3, Are most direct solvers accompanied by an iterative companion to find right values?

Thanks in Advance,

Best Regards,
test.cpp (9.73 KB)

Factorizations like the one you are playing with don’t need preconditioners but you have to ensure that your matrix meets the mathematical prerequisites, ie. non-singular and symmetric. I am going to guess that full dense matrices of random numbers don’t qualify.

Not for dense problems. There are libraries of “classic” sparse matrices for benchmarking purposes (the Boeing-Rutherford library is probably well known). In matrix algebra, algorithm design and performance testing almost always revolves around structure and condition number.

I am not convinced you have a precision problem.

Not for direct solvers. Usually the onus is on the user to determine the mathematical properties of the problem and then choose an appropriate solver. The structure of Lapack and Linpack (and hence just about every other linear algebra library) use that paradigm.

I think you meant “non-singular and square”… LDL^T is the thing for Symmetric matrices.

Thanks for these useful pointers.

Hmmm… Thats sad. :-(

Even it is difficult for me to get convinced… Did u see the attachment in the first post ?

That makes lot of sense. Thanks a LOT!

Yes I looked at it, and I calculated the determinant for your two matrices, which are nowhere close to singular. I don’t have access to a single precision factorization now, but I was able to factorize your 20x20 desi in double precision with an L2 norm of about 2D-14, so it is certainly well posed.

All I can suggest is to use one of the correspondin LAPACK routines as a “gold” version and see how your implementation fares.

If you compute this in (single|double) precision, use (s|d) LAPACK as “gold” (despite me hating this terminology in the SDK lingo). The expected error due to floating point not being associative (and cancellation amplyfying roundoff and blablabla) is in the noise for any reasonable test example.

If single precision is not accurate enough, iteratively refine.

13 seconds for 5000x5000 problem that you get on C1060 implies 6.4 Gflop/s. Even ancient Athlon must be able to do this. On a modern CPU you can get 10x this number. I suppose this 133x speedup is compared to your own code, not ACML.

If you don’t pivot such as in the attached code, the error may be arbitrarily large.

I could not reproduce that. I see values “-3790.159180 -3787.652344”.

Many direct solvers such as in LAPACK are accompanied by iterative refinement to improve the accuracy.



Thanks for taking your time to look at the code. Thanks!
The Matrix book says that all the leading principal sub-matrices must be non-singular… Not just the matrix…
So, I believe there must be some kind of pivoting that goes in…
And as you say, a matrix full of random numbers is just garbage, I think.

Dominic, Avidday

I think LAPACK does not offer LDM^T. Even the matrix computation book uses it as an intermdiate result for proving LDL^T.
LDL^T offered by LAPACK had pivoting and lot of things going in. It does not have one-to-one correpondence to the algorithm that I mentioned in the book. So, I think I have to read up more.
Difficult for a novice like me to figure out Fortran and LAPACK… Anyway, I am gonna try and figure out what it all means.


Thanks for responding. Yeah, 6 GFLOPS really sucks… That one includes memcopies and some CPU calculations as well…
The exact GPU time adds up to 7 seconds… So that roughly comes to 11.9 GFlops… Still bad though.

And, THANKS a lot for pointing out a weak CPU implementation. I have just downloaded ACML (thanks for pointing to it once again).
I will use ACML to evaluate LL^T for 5000 and see how that goes. I think LL^T has 1/2 the complexity of LDM^T. So atleast that will gimme an indication.

You have reproduced th result correctly. “-3790.159180 -3787.652344” - The absolute difference between them is around 2… which is a big error for 20x20 matrix… At this point, I think the initial random matrix is a bad choice. I need to go figure what is a good candidate before testing all these.

Thanks for all your answers. Hopefully, I will get back soon with something more meaningful.


Best Regards,

Kindly be patient while I share my observations. I have some questions on them at the end.

I have an update. The 133x speedup was against my own implementation that was done to reflect the parallel-order of floating arithmetic execution.

So, I benchmarked the code against the bookish algorithm. And the speedup comes ~ to 95x.

Find below the bookish code:

The Matrix is passed as REAL** Fortran Array. But inside – it is one big linear array (starting from A[1] - Fortran Indexing)

In the code below:

  1. NUM_SQR_F(m,n,i,j) basically finds A[i,j]

  2. Row Major Order of storing elements

  3. The Array “REAL**A” works for FORTRAN type indices (1,1) to (n,n)

#define NUM_SQRM_F(m,n,i,j) m[(i-1)*(n+1) + j]

int LDMTFactor(REAL **A, int n)


	int i,j,k;

				REAL *m = A[1];

	REAL *v;

	v = (REAL *)malloc((n+1) * sizeof(REAL));

	for(j=1; j<=n; j++)




			v[i] = NUM_SQRM_F(m,n,i,j);




			for(int l=k+1; l<=j; l++)


				v[l] = v[l] - v[k]*NUM_SQRM_F(m,n,l,k);





			NUM_SQRM_F(m,n,i,j) = v[i]/NUM_SQRM_F(m,n,i, i);


		NUM_SQRM_F(m,n,j,j) = v[j];





				NUM_SQRM_F(m,n,i,j) = NUM_SQRM_F(m,n,i,j) - v[k]*NUM_SQRM_F(m,n,i,k); 






			NUM_SQRM_F(m,n,i,j) = NUM_SQRM_F(m,n,i,j)/v[j];




	return 0;


This code takes 1306 seconds to LDM^T factorize a 5000x5000 matrix. (malloc and free are included in timing - but I dont think that would take much time)

Let us keep aside the fact that the matrix I passed did not have correct values. It does not matter for the CPU.

As Vasily says – modern CPUs can be 10 times faster than the GPU output (13.5 seconds) by using special math libraries… . That means ACML or Intel MKL should be able to do this is 1 or 2 seconds… And, that would mean 650x performance increase…

What kind of special things that these libraries do to get such monstruous performance…?

As far as I know, CPU code can be optimized by

  1. SSE - vector math - 4 SP floats can be multiplied at the same time compared to scalar(1)

  2. Loop unrolling

3 Branch prediction (Static prediction always favours a backward jump… so loops should technically be fine)

  1. Cache optimization (Prefetching and making sure that caches are fully utilized by the computation) - How big is the sum of L1+L2+L3?? Any idea?

Can some1 leave their ideas?

I will be profiling ACML for LL^T tomorrow. (LL^T may have cache advantages because columns and rows will be the same… That might realllly help).

May be, storing the matrix in a particular way may actually help reap the cache advantage. but then there is a limit to the cache size anyway.

Any ideas?


Best Regards,


You may find many answers to your questions in books and tutorials on high-performance computing, such as “Numerical Linear Algebra for High-Performance Computers” by Dongarra, Duff, Sorensen and van der Vorst and “LAPACK Working Note #28: The IBM RISC System/6000 and Linear Algebra Operations” by Dongarra, Mayes, and di Brozolo.

Thank you Vasily… You are right. Not having a strong theoretical foundation is hurting me now… How I wish, I could go back to schoool…

Thanks for the links!

I am at least glad that I got the right-looking concept right on my own and I exploited that parallelism in the CUDA implementation. I need to see how block factoring can make such a huge difference in CPU side… Lets c. Thanks!