Cublas: column-major or row-major ?

Hi everybody,

first of all i would like to say that i’m a beginner in Cublas developpement on Linux.

I’ve read in the Cuda Cublas manual (that one) that Cublas was using column-major storage et 1-base indexing.

But when i run this double loop to calculate a matrix product between a tranpose and its matrix (At . A), everything is working well, or it should not isn’t it ?

Here is the double loop i’m talking about :

for(i=0; i < NC; ++i) {

   for(j=0; j < NC; ++j) {

     *(AiTAi+i*NC+j) = cublasDdot(BLOC, dev_Ai+i, NC, dev_Ai+j, NC);

   }

}

At the end, the content of AiTAi is right, but i’m using 0-based indexing et row-major storage.

Is there something wrong in my code ?

Thanks !

1-index is a convention of fortran. If you use C and follow 0-based index, then you don’t need to worry about this.

your A is row-major, so A(k,i) = A[kNC + i] = (A+i)[kNC], here A of A[…] is a pointer.

dev_Ai+i is starting address of column i.

dev_Ai+j is starting address of column j.

A’(i,k)=A(k,i)=(A+i)[kNC] = (dev_Ai+i)[kNC], so your setting incx = NC is correct.

Thanks for your answers.

But now if i call cublasDtrsm() with my two matrix, all the results are wrong.
Is it because i use row-major storage instead of column-major like required in the Cublas manual ?

The goal of my program is to resolve A.x = b systems with an “out of core” treatment using cublasDtrsm.

Thanks a lot.

If b is a vector, then I will suggest cublasDtrsv, you can still use row-major but need to call transpose version,

cublasDtrsv( …, trans = ‘T’, …, b, incx=1)

Reason: A is row-major, and lda = N

A(i,j) = A[ i*N + j] = A[ column-major(j,i) ]

CUBLAS uses column-major, if AG is the matrix interpreted by CUBLAS, then AG(j,i) = AG[ column-major(j,i) ] = A(i,j).

This is why you need to call transpose version.

Here is a part of my code :

if(cublasSetMatrix(NC, NC, sizeof(double), AtA, NC, dev_AtA, NC) != CUBLAS_STATUS_SUCCESS) {

	cublasShutdown();

	return 1;

}

if(cublasSetVector(NC, sizeof(double), Atb, 1, dev_Atb, 1) != CUBLAS_STATUS_SUCCESS) {

	cublasShutdown();

	return 1;

}

cublasDtrsv('U', 'T', 'N', NC, AtA, NC, Atb, 1);

if(cublasGetVector(NC, sizeof(double), dev_Atb, 1, Atb, 1) != CUBLAS_STATUS_SUCCESS) {

	getMessage(cublasGetError());

	cublasShutdown();

	return 1;

}

AtA is an NCxNC matrix and Atb is a vector of NC lines, both uses row-major.

I’m still having some troubles running this code. For an example, if i use this matrix, ((1 2)(3 4)), and this vector (5 6), cublasDtrsv() returns (24 34).

Thanks for your help.

cublasSetMatrix(NC, NC, sizeof(double), AtA, NC, dev_AtA, NC)

dev_AtA = transpose(AtA), i.e. dev_AtA(i,j) = AtA(j,i)

You have wrong setting on dtrsv, please pass device memory pointers

cublasDtrsv('U', 'T', 'N', NC, dev_AtA, NC, dev_Atb, 1);

also if ‘U’ is upper triangle of AtA, then you must specify ‘L’ because dev_AtA = transpose(AtA).

cublasDtrsv('L', 'T', 'N', NC, dev_AtA, NC, dev_Atb, 1);

‘T’ is correct because dev_AtA = transpose(AtA).

Hi,

in fact, i do a bloc by bloc loading form my file which contains my matrix A and my vector B because the matrix is too big to be saved in the CPU memory.

For each bloc Ai of the matrix A i do this :

WHILE !feof(matrix_file)

  load(Ai, Bi)

  AiTAi = transpose(Ai) . Ai

  AiTBi = transpose(Ai) . Bi

  ATA += AiTAi

  ATB += AiTBi

END

At the end of this part, AtA contains a column-major symetrical matrix and AtB is a simple vector, so i have no transpose to care about.

Now, when i try to call cublasDtrsv() like this it returns wrong values.

cublasDtrsv('L', 'N', 'N', NC, dev_AtA, NC, dev_Atb, 1);

If i try with this matrix,

1 0 0 0 0

0 1 0 0 0

0 0 1 0 0

0 0 0 1 0 

0 0 0 0 1

and this vector B,

1

2

3

4

5

the call returns the right values.

But if i try with that matrix and the same vector B, the results are wrong again …

1 0 0 0 0

0 1 0 0 0

0 0 1 0 0

0 0 0 1 0 

1 0 0 0 1

Here is my code, with french comments, sorry about that.

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <cuda.h>

#include <cuda_runtime_api.h>

#include <cublas.h>

#include "magma.h"

#include "magma_lapack.h"

#define IDX2C(i,j,ld) (((j)*(ld))+(i))

#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))

void sum(double * A, double * Ai, int NL, int NC) {

	int i;

	

	for(i=0; i < NL*NC; ++i) {

		*(A+i) += *(Ai+i);

	}

}

int getMessage(cublasStatus c) {

	switch(c) {

		case CUBLAS_STATUS_SUCCESS : break;

		case CUBLAS_STATUS_NOT_INITIALIZED :

			printf("Librairie CUBLAS non initialisée.\n");

		break;

		case CUBLAS_STATUS_ALLOC_FAILED :

			printf("Problème d'allocation mémoire.\n");

		break;

		case CUBLAS_STATUS_ARCH_MISMATCH :

			printf("Le périphérique utilisé ne gère pas la double précision.\n");

		break;

		case CUBLAS_STATUS_EXECUTION_FAILED :

			printf("Echec lors du l'exécution de la fonction sur le GPU.\n");

		break;

		default: printf("ERREUR NON REPERTORIEE\n"); break;

	}

}

void read_dimensions(FILE * file_a, int * NL, int * NC) {

	fscanf(file_a, "%d\n", NL);

	fscanf(file_a, "%d\n", NC);

	

}

void read_bloc(FILE * file_a, FILE * file_b, double * Ai, double * bi, int NC, int BLOC, int * lines_read) {

	int i = 0, j;

	

	/* Lecture du fichier matrice */

	while(!feof(file_a) && i < BLOC) {

		for(j=0; j < NC; ++j) {

			fscanf(file_a, "%lf ", (Ai+(i*NC)+j));

			//fscanf(file_a, "%lf ", (Ai+(j*NC)+i));

		}

		i++;

	}

	

	/* Lecture du fichier vecteur */

	i = 0;

	while(!feof(file_b) && i < BLOC) {

		fscanf(file_b, "%lf\n", (bi+i));

		i++;

	}

	

	*lines_read += i;

}

int main(int argc, char * argv[]) {

	

	if(argc != 4) {

		printf("\n\tusage: ./cublas_cholesky [bloc] [file_a] [file_b]\n\n");

		return 1;

	}

	

	/* Pointeurs GPU */

	double * dev_Ai, * dev_AitAi, * dev_Aitbi, * dev_bi, * dev_AtA, * dev_Atb;

	/* Pointeurs CPU */

	double * Ai, * bi, * AitAi, * Aitbi, * AtA, * Atb;

	int NL, NC, BLOC, i, j, lines_read = 0;

	FILE * file_a, * file_b, * res;

	

	/* Init */

	BLOC = atoi(argv[1]);

	file_a = fopen(argv[2], "r");

	file_b = fopen(argv[3], "r");

	res = fopen("res.txt", "w");

	

	/* Ouverture du fichier de la matrice */

	if(!file_a) {

		printf("erreur: impossible d'ouvrir le fichier '%s'.\n", argv[2]);

		return 1;

	}

	

	/* Ouverture du fichier du vecteur */

	if(!file_b) {

		printf("erreur: impossible d'ouvrir le fichier '%s'.\n", argv[3]);

		return 1;

	}

	

	/* Chargement des dimensions */

	read_dimensions(file_a, &NL, &NC);

	/* bloc Ai */

	if( !(Ai = (double*)malloc(NC*BLOC*sizeof(double)))) {

		printf("fail!\n");

		return 1;

	}

	

	/* bloc bi */

	if( !(bi = (double*)malloc(BLOC*sizeof(double)))) {

		printf("fail!\n");

		return 1;

	}

	

	/* Ait.Ai */

	if( !(AitAi = (double*)malloc(NC*NC*sizeof(double)))) {

		printf("fail!\n");

		return 1;

	}

	

	/* Ait.bi */

	if( !(Aitbi = (double*)malloc(NC*sizeof(double)))) {

		printf("fail!\n");

		return 1;

	}

	

	/* At.A */

	if( !(AtA = (double*)malloc(NC*NC*sizeof(double)))) {

		printf("fail!\n");

		return 1;

	}

	

	/* At.b */

	if( !(Atb = (double*)malloc(NC*sizeof(double)))) {

		printf("fail!\n");

		return 1;

	}

	

	/* Initialisation de Cublas */

	cublasInit();

	

	/* Ai en GPU */

	if(cublasAlloc(BLOC*NC, sizeof(double), (void **)&dev_Ai) != CUBLAS_STATUS_SUCCESS) {

		printf("fail!\n");

		cublasShutdown();

		return 1;

	}

	

	/* vecteur b en GPU */

	if(cublasAlloc(BLOC, sizeof(double), (void **)&dev_bi) != CUBLAS_STATUS_SUCCESS) {

		printf("fail!\n");

		cublasShutdown();

		return 1;

	}

	

	/* AitAi en GPU */

	if(cublasAlloc(NC*NC, sizeof(double), (void **)&dev_AitAi) != CUBLAS_STATUS_SUCCESS) {

		printf("fail!\n");

		cublasShutdown();

		return 1;

	}

	/* Aitbi en GPU */

	if(cublasAlloc(NC, sizeof(double), (void **)&dev_Aitbi) != CUBLAS_STATUS_SUCCESS) {

		printf("fail!\n");

		cublasShutdown();

		return 1;

	}

	

	/* At.A en GPU */

	if(cublasAlloc(NC*NC, sizeof(double), (void **)&dev_AtA) != CUBLAS_STATUS_SUCCESS) {

		printf("fail!\n");

		cublasShutdown();

		return 1;

	}

	

	/* At.b en GPU */

	if(cublasAlloc(NC, sizeof(double), (void **)&dev_Atb) != CUBLAS_STATUS_SUCCESS) {

		printf("fail!\n");

		cublasShutdown();

		return 1;

	}

	

	/* Initialisation à 0 de AtA et Atb */

	memset(AtA, 0, NC*NC*sizeof(double));

	memset(Atb, 0, NC*sizeof(double));

	while( (NL - lines_read) > 0 ) {

		

		/* Chargement d'un bloc de la matrice en mémoire */

		read_bloc(file_a, file_b, Ai, bi, NC, BLOC, &lines_read);

	

		/* Mise en mémoire GPU de la matrice Ai */

		if(cublasSetMatrix(BLOC, NC, sizeof(double), Ai, BLOC, dev_Ai, BLOC) != CUBLAS_STATUS_SUCCESS) {

			printf("fail!\n");

			cublasShutdown();

			return 1;

		}

		

		/* Mise en mémoire GPU du vecteur b */

		if(cublasSetVector(BLOC, sizeof(double), bi, 1, dev_bi, 1) != CUBLAS_STATUS_SUCCESS) {

			printf("fail!\n");

			cublasShutdown();

			return 1;

		}

	

		/* Calcul de (Ait . Ai) */

		for(i=0; i < NC; ++i) {

			for(j=0; j < NC; ++j) {

				*(AitAi+j*NC+i) = cublasDdot(BLOC, dev_Ai+i, NC, dev_Ai+j, NC);

			}

		}

		if(cublasGetError() != CUBLAS_STATUS_SUCCESS) {

			getMessage(cublasGetError());

			cublasShutdown();

			return 1;

		}

		

		/* Calcul de (Ait . bi) */

		for(i=0; i < NC; ++i) {

			*(Aitbi+i) = cublasDdot(BLOC, dev_Ai+i, NC, dev_bi, 1);

		}

		if(cublasGetError() != CUBLAS_STATUS_SUCCESS) {

			getMessage(cublasGetError());

			cublasShutdown();

			return 1;

		}

		

		/* Somme de Ai et bi */

		sum(AtA, AitAi, NC, NC);

		sum(Atb, Aitbi, NC, 1);

	}

	/* Désallocation CPU et GPU des matrices inutiles */

	cublasFree(dev_Ai);

	cublasFree(dev_bi);

	cublasFree(dev_AitAi);

	cublasFree(dev_Aitbi);

	

	free(Ai);

	free(bi);

	free(AitAi);

	free(Aitbi);

	

	close(file_a);

	close(file_b);

	

	/* Mise en mémoire GPU de la matrice At.A */

	if(cublasSetMatrix(NC, NC, sizeof(double), AtA, NC, dev_AtA, NC) != CUBLAS_STATUS_SUCCESS) {

		printf("fail!\n");

		cublasShutdown();

		return 1;

	}

	

	/* Mise en mémoire GPU de la matrice At.b */

	if(cublasSetVector(NC, sizeof(double), Atb, 1, dev_Atb, 1) != CUBLAS_STATUS_SUCCESS) {

		printf("fail!\n");

		cublasShutdown();

		return 1;

	}

	

	/* Résolution du système */

	cublasDtrsv('L', 'N', 'N', NC, dev_AtA, NC, dev_Atb, 1);

	if(cublasGetError() != CUBLAS_STATUS_SUCCESS) {

		getMessage(cublasGetError());

		cublasShutdown();

		return 1;

	}

	

	memset(Atb, 0, NC*sizeof(double));

	

	/* Récupération de la matrice */

	if(cublasGetVector(NC, sizeof(double), dev_Atb, 1, Atb,1) != CUBLAS_STATUS_SUCCESS) {

		printf("fail!\n");

		getMessage(cublasGetError());

		cublasShutdown();

		return 1;

	}

	

	/* Enregistrement du vecteur X dans le fichier résultat */

	for(i = 0; i < NC; ++i) {

		printf("%f\n", *(Atb+i));

	}

	close(res);

	

	/* Désallocation de la mémoire GPU */

	cublasFree(dev_AtA);

	cublasFree(dev_Atb);

	

	/* Libération des ressources GPU de cublas */

	cublasShutdown();

	

	/* Désallocation de la mémoire CPU */

	free(AtA);

	free(Atb);

	

	return 0;

}

Before Dtrsv

AtA = 5  x 5 

2.000000 0.000000 0.000000 0.000000 1.000000 

0.000000 1.000000 0.000000 0.000000 0.000000 

0.000000 0.000000 1.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 1.000000 0.000000 

1.000000 0.000000 0.000000 0.000000 1.000000 

Atb = 5 

Atb[0] = 6.000000

Atb[1] = 2.000000

Atb[2] = 3.000000

Atb[3] = 4.000000

Atb[4] = 5.000000

after Dtrsv

Atb[0] = 3.000000

Atb[1] = 2.000000

Atb[2] = 3.000000

Atb[3] = 4.000000

Atb[4] = 2.000000

That is Dtrsv is triangular solver, if AtA = L + D + U,

then Dtrsv solves (L + D)* x = Atb

I think this is not what you want.

Yes i want a triangular solver because my matrix AtA is symetrical.
AtA = L + U with L = transpose(U).

what is your app?

Ax = b, then apply normal equation A’ * A * x = A’*b

if so, why not using QR, then solve R*x = Q’*b by Dtrsv

That’s what i’m doing but i have wrong results … and i’m use column-major storage.

Dtrsv cannot solve your problem A’Ax = A’*b directly

I think you need QR routine of MAGMA or CULA to solve A = QR, then apply Dtrsv to solve R*x = Q’*b.

Ok thanks for your answers but could you explain why dtrsv can’t work with my matrix ?

Thanks.

You need to invert matrix A’*A and CUBLAS does not support inversion of a full matrix.

If your matrix is either lower triangle or upper triangle, then you can use Dtrsv.

For example, if you want to solve Ax=b, then you can do LU decomposition first, say A = LU.

Traditionally, we store lower triangle matrix L and upper triangle matrix U into A.

step 1: solve L*z = b by dtrsv(‘L’, …) because L is lower triangle of A.

step 2: solve U*x = z by dtrsv(‘U’, …) because U is upper triangle of A.

Why is it necessary to invert my matrix A’*A ?

If we take E = A’*A and F = A’b, why CUBLAS can’t resolve Ex=F ??

I can’t resolve my system another way because A is a 10.000x10.000.000 matrix that can’t be loaded in the memory and has to be loaded bloc by bloc like i do.

Thanks for your help.

A of size 10000x10000 is 800MB. C2050 can afford it.

You need to solve normal equation first, then you can consider big matrix.

just try small matrix, 2x2 or 5x5, remember CUBLAS is not LAPACK!

It’s not 10.000 x 10.000 but 10.000 x 10.000.000 ! It makes 745Go …

I tried to solve with 2x2 and 5x5 matrix but apparently CUBLAS is not working as i expected.

I remind you that i’m trying to solve the equation A’Ax = A’*b but dtrsv() is not returning the good results.
Is use A’*A instead of A because A is too big whereas A’*A is a 10.000 x 10.000 (if A 10.000 columns) matrix and can be loaded in the memory.

I’m not trying to make an app, i’m working for somebody who wants me to do like this.

Thanks

If you have matlab, then try a simple example

A = [1 1 ; 2 1]

b = [2 ; 3]

A is invertible, A*x = b has unique solution x = A\b = [1 ; 1]

AtA = A'*A ;

Atb = A'*b ;

AtA =

5     3

     3     2

Atb =

8

     5

You can solve A’A*x = A’*b, then x = AtA \ Atb = [1 ; 1]

However Dtrsv solve another problem B *x = Atb where

B = [5 0 ; 3 2]

then x = [1.6 ; 0.1]

So you can not solve normal equation by Dtrsv directly.

if A is 10.000 x 10.000.000, then I would suggest QR, say A*P = [A1 A2]

P is a permutation matrix, and A1 is a 10000 x 10000 matrix and rank(A) = rank(A1), then A1 = Q * R

you need to solve (AP’)(Px) = b.

set z = Px = [z1 ; z2] , z1 has dimesion 10000

(AP’)(Px) = b implies QRz1 + A2*z2 = b, set z2 = 0, then z1 = R \ (Q’b).

This is better than A’Ax = A’*b because A’*A is 10.000.000 x 10.000.000.

I think that it’s a good idea.

Thanks for all your help !