Wrong output for QR factorisation Using CULA and MKL

Hi,

I would like to implement a Qr factorization using GPU.
First, I have to take a reference to make benchmarks, so I’m trying to use the sgeqrf subroutine from Mkl.
The code works but it gives me wrong result, comparing with Matlab.

Second, I try to use the sgeqrf subroutine from CULA basic and I have the same problem.

If someone could take a look of my code using Mkl :

#include <mkl.h>
#include <stdio.h>
#include <stdlib.h>

int main()
{
int i,j;

int M=6; //number of rows
int N=6; //number of columns

//allocation memoire
float* A_h=(float*)malloc(M*N*sizeof(float));
float* tau=(float*)malloc(N*sizeof(float));

//Initialisation of random matrix A
printf("Initialisation of A\n");
for (i=0;i<M;i++)
{
	for (j=0;j<N;j++)
	{
		A_h[i*N+j]=rand() % 10;
	}
}

//Write the matrix in A.txt
FILE* A=fopen("A.txt","w");
if (A)
{
	for (i=0;i<M;i++)
	{
		for (j=0;j<N;j++)
		{
			fprintf(A,"%f\t",A_h[i*N+j]);
		}
		fputc('\n',A);
	}
}
fclose(A);

int lwork=N;
float* work=(float*)malloc(lwork*sizeof(float));
int info=0;

printf("execution of sgeqrf\n");

sgeqrf(&M, &N, A_h, &M, tau, work, &lwork, &info);

//Write the output in out_Mkl.txt
FILE* out=fopen("out_Mkl.txt","w");
if (out)
{
	for (i=0;i<M;i++)
	{
		for (j=0;j<N;j++)
		{
			fprintf(out,"%f\t",A_h[i*N+j]);
		}
		fputc('\n',out);
	}
}
fclose(out);

free(A_h);
free(tau);
return 0;

}

Makefile :

CC=icc
CFLAGS=-O3
LIBS=-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread

build64:
${CC} -o qrMkl qrMkl.c $(CFLAGS) $(LIBS)

It’s a very simple case with M=N=6 …

Thank you for your answers.

Probably just column major versus row major ordering between what you are passing to sgeqrf and the way you are writing the data to file to use in Matlab. But I am a little curious as to what this has to do with CUDA?

Thank you for your answer.

I try to wrtie a code in CUDA, but I would like to compare right output with a CPU’s code using MKL.

It seems to be a difficult task to implement a QR factorization only on GPU CUDA.

So, my idea is to use sgeqrf from CULA or MAGMA, to compute QR factorization for larges matrix (M=100000, N=10000), using block Householder algorithm.

You are right, there aren’t Cuda kernels in this idea…

I haven’t understood your answer about column and row-major, what I have to change?

Thanks.

The concept of storage ordering is well explained here. BLAS and LAPACK use column-major ordered storage conventions (inherited from FORTRAN). When you are writing out arrays to file, you are using row-major ordering. I am guessing that if you change the code which writes the arrays to file to this:

for (i=0;i<M;i++)

{

	for (j=0;j<N;j++)

	{

		fprintf(out,"%f\t",A_h[i+j*N]);

	}

	fputc('\n',out);

}

you might get better agreement with Matlab.

The code works now, thank you very much.

I had the same problem with sgemm from Cublas, I solved this problem but I forgot it ! Now it’ok, thank you.

So, I can use CULA or MAGMA and compare results between mkl vs CULA-MAGMA for larges matrix…

I’ve just modified the code by call sgeqrf with the transpose of A on input, and write the result as you told me, and it’s ok.

Thanks.