LU, QR and Cholesky factorizations using GPU

The LU factorization works well but numerical results are wrong for me. If gpu_sgetrf works as netLib’ sgetrf, we obtain A=PLU. But how can I compute matrix P ?

I have these inputs with a random matrix 4x4 :

Device: GeForce 9600M GT, 1250 MHz clock, 511 MB memory.

Matrix A:

3.00	    6.00	    7.00	    5.00	
3.00	    5.00	    6.00	    2.00	
9.00	    1.00	    2.00	    7.00	
0.00	    9.00	    3.00	    6.00	

Matrix B:

1.00	
3.00	
1.00	
2.00	

Start sgetrf
End sgetrf succefuly

Matrix PLU:

7.00	    0.71	    0.43	    0.86	
2.00	    4.57	   -0.19	    0.94	
9.00	    0.57	   -1.75	    0.14	
3.00	    3.86	    4.44	   -6.82	

ipiv : 3 4 3 4

Matrix L:

1.00	    0.00	    0.00	    0.00	
2.00	    1.00	    0.00	    0.00	
9.00	    0.57	    1.00	    0.00	
3.00	    3.86	    4.44	    1.00	

Matrix U:

7.00	    0.71	    0.43	    0.86	
0.00	    4.57	   -0.19	    0.94	
0.00	    0.00	   -1.75	    0.14	
0.00	    0.00	    0.00	   -6.82	

Matrix L*U:

7.00	    0.71	    0.43	    0.86	

14.00 6.00 0.67 2.65
63.00 9.04 2.00 8.39
21.00 19.78 -7.20 0.00

Matrix X:

1.05	
0.48	

-0.77
-0.12

n flop/s time LU info time solver


4 0.000350 0.000122s 0 0.000008s

Success exit

And vector X is totaly wrong when I give the output matrix A from gpu_sgetrf to sgetrs “sgetrs(‘N’,n,1,A,lda,ipiv,B,n,&info);”

Thanks for your answers!

PS: I’ve done a Benchmark about sgetrf subroutine on my CPU (3,06 Ghz Intel Core 2 duo) and my GeForce 9600M. I am going to do it again with an Nvidia GTX295 at my work!
benchmark_LU_Decomposition_Coppey.pdf (356 KB)

I think you will find they are correct, it is just you just don’t understand how use to them

No, in lapack sgetrf, you get L and U packed into the upper and lower triangular halves of A, and P (or the equivalent index which will effect the permutation) needs to be formed from ipivot.

For your example, the following python shows how it can be done

import numpy as np

import scipy.linalg as linalg

A=np.array( [ [3.00, 6.00, 7.00, 5.00],

			  [3.00, 5.00, 6.00, 2.00],

			  [9.00, 1.00, 2.00, 7.00],

			  [0.00, 9.00, 3.00, 6.00] ]) 

LU,pivot = linalg.lu_factor(A)

N=len(pivot)

perm=arange(0,N)

for i in range(0,N):

	temp = perm[i]

	perm[i] = perm[pivot[i]]

	perm[pivot[i]] = temp

I=np.eye(N,N)

P=np.empty((N,N))

for i in range(0,N):

	P[i,:] = I[perm[i],:]

L=linalg.tril(LU,-1)+I

U=linalg.triu(LU)

print A

print LU,pivot

print P

print A-np.dot(P,np.dot(L,U))

which, when run, produces this:

[user@localhost ~]$ python lu.py

[[ 3.  6.  7.  5.]

 [ 3.  5.  6.  2.]

 [ 9.  1.  2.  7.]

 [ 0.  9.  3.  6.]]

[[ 9.		  1.		  2.		  7.		]

 [ 0.		  9.		  3.		  6.		]

 [ 0.33333333  0.62962963  4.44444444 -1.11111111]

 [ 0.33333333  0.51851852  0.85	   -2.5	   ]] [2 3 2 3]

[[ 0.  0.  1.  0.]

 [ 0.  0.  0.  1.]

 [ 1.  0.  0.  0.]

 [ 0.  1.  0.  0.]]

[[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]

 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   6.66133815e-16]

 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]

 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]]

Thank you avidday for all your detailed answers !

I computed errors or gaps between all true solutions of the linear system (Matrix B from AX=B) and solutions found by subroutines sgertf => sgetrs (A*X) :

[codebox]

for (int i = 0; i < n; i++)

{

	for (int j = 0; j < nrhs; j++)

	{

		for (int x = 0;  x< n; x++){

			AX(i,j) += A_ref(i,x)*B(x,j);	

		}

	}

}

double error = 0;

for (int i = 0; i < n; i++)

{

	for (int j = 0; j < nrhs; j++)

	{	

		error += fabs(B_ref(i,j)-AX(i,j));

	}



}

[/codebox]

We can see errors are not uniforms and very important as following :

n flop/s time LU info time solver errors


8000 38.355585 8.899182s 0 0.053062s 1895.537863493

8000 38.311660 8.909385s 0 0.053999s 282.375109434

8000 38.325744 8.906111s 0 0.053915s 597.245272934

8000 38.332523 8.904536s 0 0.053764s 1023.778610349

8000 38.232213 8.927899s 0 0.054831s 722.841670394

I know that the type of parameters is Single (FLOAT) and double type is not allowed with this gpu library.

What do you think about errors ?

Thanks

I think it is more useful to compute the absolute maximum error (ie. the L infinity norm) of the deviation between the solutions. Summing the errors like that doesn’t really tell very much (or even tell whether the operation really is working). That error of 1895 sounds like a lot, but it is only an average deviation of 1895/8000/8000 = 2.9e-5, which is very reasonable for single precision. But from the sum of the deviations, it is impossible to know what the largest deviation is, which is probably far more important.

Hi, I have found GPU MAGMA library that integrates vasily’s works I think !

Someone use it ? Because I asked some question on magma forum but nobody answer me! My post on Magma forum

I would like to know why magma subroutine sgetrs for solving linear system is less speed than the CPU one (LAPACK, ACML) ?

Because I have to solve a linear system of N*3 unknowns and a coefficient matrix of size N*N and I have to iterate this resolution N*O time !
For N = 1000, O=500, so, I have to solve 500 000 time the system and with an average time of 16 ms/resolution => 8000 seconds (2h13)

That why, I need a very speed linear solver.

PS :

I have made a benchmark of the magma subroutine sgetrs solving a linear system after an LU factorization (sgetrf). I used these materials :
a GPU Nvidia Geforce 9600M with 16 cores, 1250Mhz, 512MO DRAM and a CPU Intel Core 2 Duo with 3,06Ghz and 4GO DDR3.

The results of (Time is in milliseconds):
External Media

Vasily, do you know if any work have been do about LU factorization of sparse matrices as band-diagonal matrices ?

What was the largest LU factorization matrix ?

Thanks

Julien

No idea, I don’t really follow recent work on the subject.

Vasily

Volkov, would you please provide me with the execution times of Cholesky factorization on Core2 Quad and/or GTX280. I am working on my own version of Cholesky factorization and I need them for comparison purposes.

Thanks,
Asma

Volkov, would you please provide me with the execution times of Cholesky factorization on Core2 Quad and/or GTX280. I am working on my own version of Cholesky factorization and I need them for comparison purposes.

Thanks,
Asma

The graph at the top of this thread shows Gflop/s rates for Cholesky factorization on these platforms. You can convert it to execution time using formula Time = Rate*(NNN/3) where N is the dimension of the matrix.

Vasily

The graph at the top of this thread shows Gflop/s rates for Cholesky factorization on these platforms. You can convert it to execution time using formula Time = Rate*(NNN/3) where N is the dimension of the matrix.

Vasily

Thanks Vasily, but shouldn’t the “rate” be in the denominator?

I also tried the formula that you used to produce the execution times that frea requested " Gflop/s rate = 4e-9nn*n/3/seconds" but I failed to reproduce the numbers you provided. I don’t know what I am doing wrong.

Asma

Thanks Vasily, but shouldn’t the “rate” be in the denominator?

I also tried the formula that you used to produce the execution times that frea requested " Gflop/s rate = 4e-9nn*n/3/seconds" but I failed to reproduce the numbers you provided. I don’t know what I am doing wrong.

Asma

Thanks Vasily, but shouldn’t the “rate” be in the denominator?

I also tried the formula that you used to produce the execution times that frea requested " Gflop/s rate = 4e-9nn*n/3/seconds" but I failed to reproduce the numbers you provided. I don’t know what I am doing wrong.

Asma

Thanks Vasily, but shouldn’t the “rate” be in the denominator?

I also tried the formula that you used to produce the execution times that frea requested " Gflop/s rate = 4e-9nn*n/3/seconds" but I failed to reproduce the numbers you provided. I don’t know what I am doing wrong.

Asma

Oops, you are right, it should be in the denominator.

OK, let’s look at the QR numbers that I provided for frea. Take n=13000 for example. It runs in 16.3 seconds. We get then 4e-91300013000*13000/3/16.3 = 180 Gflop/s. I think this is very close to the number that the picture shows. (Red dashed line.)

Vasily

Oops, you are right, it should be in the denominator.

OK, let’s look at the QR numbers that I provided for frea. Take n=13000 for example. It runs in 16.3 seconds. We get then 4e-91300013000*13000/3/16.3 = 180 Gflop/s. I think this is very close to the number that the picture shows. (Red dashed line.)

Vasily

Thanks Vasily, I got it…

Asma

Thanks Vasily, I got it…

Asma

Hi I have some problems with LU decomposition in CULA BASIC, in my opinion counts wrong or my implementation is badly . Could someone help me see the error?

int main()
{

cudaError_t err;
culaStatus status;

// point to host memory
culaFloat* A = NULL;
culaInt* TAU = NULL;
int i;
// point to device memory
culaFloat* Ad = NULL;
culaInt* TAUd = NULL;

printf(“Allocating Matrices\n”);
A = (culaFloat*)malloc(NNsizeof(A[0]));
TAU = (culaInt*)malloc(N*sizeof(TAU[0]));
if(!A || !TAU)
exit(EXIT_FAILURE);

memset(A, 0, NNsizeof(A[0]));
memset(TAU, 0, N*sizeof(TAU[0]));
A[0]=3;
A[1]=3;
A[2]=0;
A[3]=0;
A[4]=2;
A[5]=2;
A[6]=1;
A[7]=0;
A[8]=1;

printf(“\n\n”);
for ( i=0; i<N*N;++i)
{
printf(“%f, “,A[i]);
if ( (i+1) % N == 0 )
{
printf(”\n”);
}
}

/* Allocate device memory for the matrices /
err = cudaMalloc((void
*)&Ad, NNsizeof(Ad[0]));
checkCudaError(err);

err = cudaMalloc((void**)&TAUd, N*sizeof(TAUd[0]));
checkCudaError(err);

printf(“Initializing CULA\n”);
status = culaInitialize();
checkStatus(status);

err = cudaMemcpy(Ad, A, NNsizeof(Ad[0]), cudaMemcpyHostToDevice);
checkCudaError(err);
err = cudaMemcpy(TAUd, TAU, N*sizeof(TAUd[0]), cudaMemcpyHostToDevice);
checkCudaError(err);

printf(“Calling culaDeviceSgetrf\n”);
status = culaDeviceSgetrf(N, N, Ad, N, TAUd);
checkStatus(status);

err = cudaMemcpy(A, Ad, NNsizeof(Ad[0]), cudaMemcpyDeviceToHost);
checkCudaError(err);
printf(“\n\n”);
for ( i=0; i<NN;++i)
{
printf(“%f, “,A[i]);
if ( (i+1) % N == 0 )
{
printf(”\n”);
}
}
err = cudaMemcpy(TAU, TAUd, N
sizeof(TAUd[0]), cudaMemcpyDeviceToHost);
checkCudaError(err);

printf(“\n\n”);
for ( i=0; i<N;++i)
{
printf(“%d, “,TAU[i]);
if ( (i+1) % N == 0 )
{
printf(”\n”);
}
}
printf(“Shutting down CULA\n”);
culaShutdown();
free(A);
free(TAU);
cudaFree(Ad);
cudaFree(TAUd);
getchar();
return EXIT_SUCCESS;
}
results.jpg