Wrong value returned on cublas LU factorization

tiagomr · April 28, 2014, 5:10pm

Hello,
I am currently implementing an algorithm that solves a linear system.
I decided to use the cublas library for it has most of the functions i need. To get used to the library I decided to do a spike solution that would do the factorization of a single static matrix.
At first I did some samples with 3x3 matrices and the results were as expected, then i tried to expand the sample to 10x10 matrices, in my understanding the process should be the same with the exception of a bigger matrice and bigger copies to and from device.

To my surprise the sample with 10x10 values does not return the correct values, with most of them being nan.

The code is as follows:

int main(int argc, char** argv){
	
	int i, j;
				  
	double arrA[10][10] = {
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 5.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0}
						  };
	
	double *arrADev, *arrBDev, *resultsVec;
	double **matrixArray;
	int *pivotArray;
	int *infoArray;
	double flat[100] = {0};
	int info[10] = {-1};
	int pivot[10] = {-1};
	cublasHandle_t cublasHandle;
	
	
	double *matrices[2];
	
	HANDLE_ERROR(cudaMalloc(&arrADev,  sizeof(double) * 100));
	HANDLE_ERROR(cudaMalloc(&arrBDev,  sizeof(double) * 100));
	HANDLE_ERROR(cudaMalloc(&resultsVec,  sizeof(double) * 10));
	HANDLE_ERROR(cudaMalloc(&matrixArray,  sizeof(double*) * 2));
	HANDLE_ERROR(cudaMalloc(&pivotArray,  sizeof(int) * 100));
	HANDLE_ERROR(cudaMalloc(&infoArray,  sizeof(int) * 100));
	cublasCreate(&cublasHandle);
	
	int matrixSize = 10;
	
	//maps matrix to flat vector
	for(i=0; i<matrixSize; i++){
		for(j=0; j<matrixSize; j++){
			flat[i+j*matrixSize] = arrA[i][j];
		}
	}
	
	//copy matrix A to device
	HANDLE_CUBLAS(cublasSetMatrix(matrixSize, matrixSize, sizeof(double), flat, matrixSize, arrADev, matrixSize));

	//save matrix address 
	matrices[0] = arrADev;
	
	//copy matrices references to device
	HANDLE_ERROR(cudaMemcpy(matrixArray,matrices, sizeof(double*)*1, cudaMemcpyHostToDevice));

	//LU factorization
	HANDLE_CUBLAS(cublasDgetrfBatched(cublasHandle, matrixSize, matrixArray, matrixSize, pivotArray, infoArray, 1));   
	
	//get info array
	HANDLE_CUBLAS(cublasGetVector(1, sizeof(int), infoArray, 1, info, 1));
	
	//get pivot array
	HANDLE_CUBLAS(cublasGetVector(matrixSize, sizeof(int), pivotArray, 1, pivot, 1));
	
	//print info array
	printf("Info Array:\n{");
	for(i=0; i<1; i++){
		printf(" %d", info[i]);
		if(i < 0)
			printf(",");
		else
			printf("}\n");
	}
	
	//print pivot array
	printf("Pivot Array:\n{");
	for(i=0; i<matrixSize; i++){
		printf(" %d", pivot[i]);
		if(i < matrixSize-1)
			printf(",");
		else
			printf("}\n");
	}
	
	//get LU matrix
	HANDLE_CUBLAS(cublasGetMatrix(matrixSize, matrixSize, sizeof(double), arrADev, matrixSize, flat, matrixSize));
	
	//print LU matrix
	printf("Matrix A\n");
	for(i=0; i<matrixSize; i++){
		for(j=0; j<matrixSize; j++){
			printf(" %12.1f", flat[i+j*matrixSize]);
		}
		printf("\n");
	}

	return 0;
}

All the values from the matrix are the same with the exception of a single value.
To my knowledge the result should be:
21.0 21.0 21.0 21.0 21.0 21.0 21.0 21.0 21.0 21.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 -16.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

the results from the cublas function are:
21.0 21.0 21.0 21.0 21.0 21.0 21.0 21.0 21.0 21.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan

I do see if the function have been successfull either with the status code and with the array info.
All the copies are being mapped from row-major to column-major and vice-versa.
Any help would be appreciated.

Robert_Crovella · April 28, 2014, 6:36pm

When I do lu decomposition in matlab on your arrA matrix, I get the same results as I get with your code running on CUDA 6. If you’re not using CUDA 6, please update to CUDA 6. And please compare against a known good source such as matlab. The “expected results” you list are not correct. The first row, for example, should be entirely 2182528. The remaining elements in the first column should be 1. The element at (6,6) should be -2182523. All other elements should be zero. Again, I get these results both in matlab and using your code with CUDA 6.

Robert_Crovella · April 28, 2014, 7:20pm

I guess you’ve edited your code now to a different arrA matrix. With the arrA matrix you show now, I get almost the same result as your expected result with CUDA 6. The value of -16 should be in the sixth row and column, not the fifth row and column as you show it. Please re-try with CUDA 6.

tiagomr · April 28, 2014, 9:16pm

Yes, you are right.
I’m sorry for the confusion, this last matrix is the one I intended to use in the example.
I currently havent updated my CUDA framework to version 6, will try and post the results ASAP.

Topic		Replies	Views
Kernel works just for small matrices CUDA Programming and Performance	14	3117	October 19, 2009
LU decompostion/factorization from Cula or v. Volkov Search an LU decomposition with partial pivotin CUDA Programming and Performance	3	6439	December 19, 2013
Cublas batched lu decomposition get segmentation fault GPU-Accelerated Libraries	3	1206	April 23, 2014
Matrix inverse usng linear system solver through cublas , cublasCreate exception or something else CUDA Programming and Performance	1	4642	June 16, 2013
simple matrix multiplication result error using cuBLAS lib CUDA Programming and Performance	2	1154	December 20, 2009
LU, QR and Cholesky factorizations using GPU CUDA Programming and Performance	100	62740	June 23, 2015
LU factorization code CUDA Programming and Performance	45	90840	June 23, 2015
cublasDgemm returns wrong results for large matrix dimensions? CUDA Programming and Performance	12	3236	November 30, 2010
CUBLAS issues Some simple question about CUBLAS CUDA Programming and Performance	1	1267	August 22, 2011
Help Matrix Multiplication using cuBLAS CUDA Programming and Performance	10	23925	July 24, 2010

Wrong value returned on cublas LU factorization

Related topics