Wrong value returned on cublas LU factorization

Hello,
I am currently implementing an algorithm that solves a linear system.
I decided to use the cublas library for it has most of the functions i need. To get used to the library I decided to do a spike solution that would do the factorization of a single static matrix.
At first I did some samples with 3x3 matrices and the results were as expected, then i tried to expand the sample to 10x10 matrices, in my understanding the process should be the same with the exception of a bigger matrice and bigger copies to and from device.

To my surprise the sample with 10x10 values does not return the correct values, with most of them being nan.

The code is as follows:

int main(int argc, char** argv){
	
	int i, j;
				  
	double arrA[10][10] = {
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 5.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0},
							{21.0,21.0,21.0, 21.0, 21.0, 21.0, 21.0,21.0, 21.0, 21.0}
						  };
	
	double *arrADev, *arrBDev, *resultsVec;
	double **matrixArray;
	int *pivotArray;
	int *infoArray;
	double flat[100] = {0};
	int info[10] = {-1};
	int pivot[10] = {-1};
	cublasHandle_t cublasHandle;
	
	
	double *matrices[2];
	
	HANDLE_ERROR(cudaMalloc(&arrADev,  sizeof(double) * 100));
	HANDLE_ERROR(cudaMalloc(&arrBDev,  sizeof(double) * 100));
	HANDLE_ERROR(cudaMalloc(&resultsVec,  sizeof(double) * 10));
	HANDLE_ERROR(cudaMalloc(&matrixArray,  sizeof(double*) * 2));
	HANDLE_ERROR(cudaMalloc(&pivotArray,  sizeof(int) * 100));
	HANDLE_ERROR(cudaMalloc(&infoArray,  sizeof(int) * 100));
	cublasCreate(&cublasHandle);
	
	int matrixSize = 10;
	
	//maps matrix to flat vector
	for(i=0; i<matrixSize; i++){
		for(j=0; j<matrixSize; j++){
			flat[i+j*matrixSize] = arrA[i][j];
		}
	}
	
	//copy matrix A to device
	HANDLE_CUBLAS(cublasSetMatrix(matrixSize, matrixSize, sizeof(double), flat, matrixSize, arrADev, matrixSize));

	//save matrix address 
	matrices[0] = arrADev;
	
	//copy matrices references to device
	HANDLE_ERROR(cudaMemcpy(matrixArray,matrices, sizeof(double*)*1, cudaMemcpyHostToDevice));

	//LU factorization
	HANDLE_CUBLAS(cublasDgetrfBatched(cublasHandle, matrixSize, matrixArray, matrixSize, pivotArray, infoArray, 1));   
	
	//get info array
	HANDLE_CUBLAS(cublasGetVector(1, sizeof(int), infoArray, 1, info, 1));
	
	//get pivot array
	HANDLE_CUBLAS(cublasGetVector(matrixSize, sizeof(int), pivotArray, 1, pivot, 1));
	
	//print info array
	printf("Info Array:\n{");
	for(i=0; i<1; i++){
		printf(" %d", info[i]);
		if(i < 0)
			printf(",");
		else
			printf("}\n");
	}
	
	//print pivot array
	printf("Pivot Array:\n{");
	for(i=0; i<matrixSize; i++){
		printf(" %d", pivot[i]);
		if(i < matrixSize-1)
			printf(",");
		else
			printf("}\n");
	}
	
	//get LU matrix
	HANDLE_CUBLAS(cublasGetMatrix(matrixSize, matrixSize, sizeof(double), arrADev, matrixSize, flat, matrixSize));
	
	//print LU matrix
	printf("Matrix A\n");
	for(i=0; i<matrixSize; i++){
		for(j=0; j<matrixSize; j++){
			printf(" %12.1f", flat[i+j*matrixSize]);
		}
		printf("\n");
	}

	return 0;
}

All the values from the matrix are the same with the exception of a single value.
To my knowledge the result should be:
21.0 21.0 21.0 21.0 21.0 21.0 21.0 21.0 21.0 21.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 -16.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

the results from the cublas function are:
21.0 21.0 21.0 21.0 21.0 21.0 21.0 21.0 21.0 21.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
1.0 -nan -nan -nan -nan -nan -nan -nan -nan -nan

I do see if the function have been successfull either with the status code and with the array info.
All the copies are being mapped from row-major to column-major and vice-versa.
Any help would be appreciated.

When I do lu decomposition in matlab on your arrA matrix, I get the same results as I get with your code running on CUDA 6. If you’re not using CUDA 6, please update to CUDA 6. And please compare against a known good source such as matlab. The “expected results” you list are not correct. The first row, for example, should be entirely 2182528. The remaining elements in the first column should be 1. The element at (6,6) should be -2182523. All other elements should be zero. Again, I get these results both in matlab and using your code with CUDA 6.

I guess you’ve edited your code now to a different arrA matrix. With the arrA matrix you show now, I get almost the same result as your expected result with CUDA 6. The value of -16 should be in the sixth row and column, not the fifth row and column as you show it. Please re-try with CUDA 6.

Yes, you are right.
I’m sorry for the confusion, this last matrix is the one I intended to use in the example.
I currently havent updated my CUDA framework to version 6, will try and post the results ASAP.