 # Wrong output for QR factorisation Using CULA and MKL

Hi,

I would like to implement a Qr factorization using GPU.
First, I have to take a reference to make benchmarks, so I’m trying to use the sgeqrf subroutine from Mkl.
The code works but it gives me wrong result, comparing with Matlab.

Second, I try to use the sgeqrf subroutine from CULA basic and I have the same problem.

If someone could take a look of my code using Mkl :

#include <mkl.h>
#include <stdio.h>
#include <stdlib.h>

int main()
{
int i,j;

``````int M=6; //number of rows
int N=6; //number of columns

//allocation memoire
float* A_h=(float*)malloc(M*N*sizeof(float));
float* tau=(float*)malloc(N*sizeof(float));

//Initialisation of random matrix A
printf("Initialisation of A\n");
for (i=0;i<M;i++)
{
for (j=0;j<N;j++)
{
A_h[i*N+j]=rand() % 10;
}
}

//Write the matrix in A.txt
FILE* A=fopen("A.txt","w");
if (A)
{
for (i=0;i<M;i++)
{
for (j=0;j<N;j++)
{
fprintf(A,"%f\t",A_h[i*N+j]);
}
fputc('\n',A);
}
}
fclose(A);

int lwork=N;
float* work=(float*)malloc(lwork*sizeof(float));
int info=0;

printf("execution of sgeqrf\n");

sgeqrf(&M, &N, A_h, &M, tau, work, &lwork, &info);

//Write the output in out_Mkl.txt
FILE* out=fopen("out_Mkl.txt","w");
if (out)
{
for (i=0;i<M;i++)
{
for (j=0;j<N;j++)
{
fprintf(out,"%f\t",A_h[i*N+j]);
}
fputc('\n',out);
}
}
fclose(out);

free(A_h);
free(tau);
return 0;
``````

}

Makefile :

CC=icc
CFLAGS=-O3

build64:
{CC} -o qrMkl qrMkl.c (CFLAGS) \$(LIBS)

It’s a very simple case with M=N=6 …

Probably just column major versus row major ordering between what you are passing to sgeqrf and the way you are writing the data to file to use in Matlab. But I am a little curious as to what this has to do with CUDA?

I try to wrtie a code in CUDA, but I would like to compare right output with a CPU’s code using MKL.

It seems to be a difficult task to implement a QR factorization only on GPU CUDA.

So, my idea is to use sgeqrf from CULA or MAGMA, to compute QR factorization for larges matrix (M=100000, N=10000), using block Householder algorithm.

You are right, there aren’t Cuda kernels in this idea…

Thanks.

The concept of storage ordering is well explained here. BLAS and LAPACK use column-major ordered storage conventions (inherited from FORTRAN). When you are writing out arrays to file, you are using row-major ordering. I am guessing that if you change the code which writes the arrays to file to this:

``````for (i=0;i<M;i++)

{

for (j=0;j<N;j++)

{

fprintf(out,"%f\t",A_h[i+j*N]);

}

fputc('\n',out);

}
``````

you might get better agreement with Matlab.

The code works now, thank you very much.

I had the same problem with sgemm from Cublas, I solved this problem but I forgot it ! Now it’ok, thank you.

So, I can use CULA or MAGMA and compare results between mkl vs CULA-MAGMA for larges matrix…

I’ve just modified the code by call sgeqrf with the transpose of A on input, and write the result as you told me, and it’s ok.

Thanks.