CUBLAS problem

Hi all,

I’ve started working with CUBLAS recently and i checked the CUDASDK example, tried it and got “PASSED”. Soon after, i made one alteration: instead of filling the matrices with random elements, i filled them with their indexes. However the is something very wrong and very strange.

Matrix A:

0.000000 4.000000 8.000000 12.000000 

0.000000 5.000000 9.000000 13.000000 

2.000000 6.000000 10.000000 14.000000 

3.000000 7.000000 11.000000 15.000000 

Matrix B:

0.000000 4.000000 8.000000 12.000000 

0.000000 5.000000 9.000000 13.000000 

2.000000 6.000000 10.000000 14.000000 

3.000000 7.000000 11.000000 15.000000 

Matrix C:

0.000000 152.000000 248.000000 344.000000 

0.000000 174.000000 286.000000 398.000000 

68.000000 196.000000 324.000000 452.000000 

74.000000 218.000000 362.000000 506.000000

How can this be? The element 1 is always 0, nevertheless the resulting C matrix is printed as if the other two had the element 1=1, except it prints its element = 0: In other words, if you check the multiplication of the printed matrices A and B it doesnt result in the printed matrix C. This does not make sense and i am certain it has something to do with the code that fills the matrices. Here are the alterations i made to the CUDASDK example:

Alteration 1:

#define index(i,j,ld) (((j)*(ld))+(i))

Alteration 2:

    float* h_A,* h_B,* h_C,*h_C_ref;

    int i,j;

    float* d_A;

    float* d_B;

    float* d_C;

Alteration 3:

        for (i=0;i<HA;i++)

	    for (j=0;j<WA;j++)   

	     h_A[index(i,j,HA)] = (float) index(i,j,HA);

        for (i=0;i<HB;i++)

           for (j=0;j<WB;j++)

	     h_B[index(i,j,HB)] = (float) index(i,j,HB);

My print function

void printMatrix(float*C,int uWC,int uHC){

	int i,j;

	for(i=0;i<uHC;i++){

		printf("\n");

		for(j=0;j<uWC;j++)

			printf("%f ",C[index(i,j,HC)]);

	}

}

Any thoughts? Thanks beforehand

P.S: Does anyone has a CUBLAS implementation of matrix multiplication that works for all type of matrices? The CUDASDK example only works for square matrices.

EDIT: If i print Matrix A or B right after filling them, they are printed right, so it must be one of the functions of the SDK example that is corrupting it

Do you use CUBLAS or matrixMul in SDK?

I assume that you use matrixMul in SDK.
Be careful: matrixMul in SDK does not check boundary

You can run cublasSgemm.

Nop, I am using the “simpleCublas” example.

P.S: sergioamendes and I are both working at the same institute(Biophysics and Biomedical Engineering Institute at the Univercity of Lisbon - Faculty of Sciences, I am working with breast tomosynthesis and he is working with positron emission mamography, anyway, we are both learning to use CUDA at the moment).

We have found the problem. I had one other doubt nonetheless, it concerns the Column Major representation

When i fill the matrices like this:

for (i=0;i<HA;i++)

            for (j=0;j<WA;j++)   

             h_A[index(i,j,HA)] = (float) index(i,j,HA);

        for (i=0;i<HB;i++)

           for (j=0;j<WB;j++)

             h_B[index(i,j,HB)] = (float) index(i,j,HB);

Is this right according to column major representation?

column or row major order is completely determined by the definition of index. What you are using is column major (ie row numbers varying fastest).

I have read some of your posts on the subject and you said something like this “you can use your normal matrix and use the transpose in the SGEMM call if you are certain of the matrix structure”. What do you mean by “certain of the matrix structure”?

I simply mean that you know that the transpose of a given column major matrix is the equivalent of the same matrix stored in row major order. That won’t be true in every case.

I have written some examples and couldn’t find one that did not meet that criteria. Could you give an an example with a little matrix?

Well in order not to open another topic i will post here a code i’ve made in order to multiply all kinds of matrices with CUBLAS. It’s almost the same as the simpleCUBLAS SDK example, with the exception of the “cublasGet/SetMatrix”. It’s working fine for square matrices,but i am getting wrong results for non square matrices(albeit very close to the right values). I do not get any error, and i have checked countless times the syntax of cublasSet/GetMatrix, cublasAlloc and cublasSgemm and it seems ok

#include <stdlib.h>

#include <stdio.h>

#include "cublas.h"

#define HA 6

#define WA 4

#define WB 4

#define HB WA 

#define WC WB   

#define HC HA  

#define index(i,j,ld) (((j)*(ld))+(i))

void printMat(float*P,int uWP,int uHP){

	//printf("\n %f",P[1]);

	int i,j;

	for(i=0;i<uHP;i++){

		printf("\n");

		for(j=0;j<uWP;j++)

			printf("%f ",P[index(i,j,HC)]);

			//printf("%f ",P[i*uWP+j]);

	}

}

int  main (int argc, char** argv) {

	    cublasStatus status;

            int i,j;

            cublasInit();

float *A = (float*)malloc(HA*WA*sizeof(float));

            float *B = (float*)malloc(HB*WB*sizeof(float));

            float *C = (float*)malloc(HC*WC*sizeof(float));

	    if (A == 0) {

            fprintf (stderr, "!!!! host memory allocation error (A)\n");

            return EXIT_FAILURE;

	    }

	    if (B == 0) {

            fprintf (stderr, "!!!! host memory allocation error (A)\n");

            return EXIT_FAILURE;

	    }

	    if (C == 0) {

            fprintf (stderr, "!!!! host memory allocation error (A)\n");

            return EXIT_FAILURE;

            }

	

            for (i=0;i<HA;i++)

		 for (j=0;j<WA;j++)

			A[index(i,j,HA)] = (float) index(i,j,HA);

			

			

            for (i=0;i<HB;i++)

		 for (j=0;j<WB;j++)

			B[index(i,j,HB)] = (float) index(i,j,HB);

					

            float* AA; float* BB; float* CC;

	    /*ALLOCATE ON THE DEVICE*/

	    status=cublasAlloc(HA*WA,sizeof(float),(void**)&AA);

            if (status != CUBLAS_STATUS_SUCCESS) {

            fprintf (stderr, "!!!! device memory allocation error (A)\n");

            return EXIT_FAILURE;

    		}

status=cublasAlloc(HB*WB,sizeof(float),(void**)&BB);

            if (status != CUBLAS_STATUS_SUCCESS) {

            fprintf (stderr, "!!!! device memory allocation error (A)\n");

            return EXIT_FAILURE;

            }

status=cublasAlloc(HC*WC,sizeof(float),(void**)&CC);

            if (status != CUBLAS_STATUS_SUCCESS) {

            fprintf (stderr, "!!!! device memory allocation error (A)\n");

            return EXIT_FAILURE;

            }

	    /*SET MATRIX*/

            status=cublasSetMatrix(HA,WA,sizeof(float),A,HA,AA,HA);

    	    if (status != CUBLAS_STATUS_SUCCESS) {

            fprintf (stderr, "!!!! device memory allocation error (A)\n");

            return EXIT_FAILURE;

    	    }

status=cublasSetMatrix(HB,WB,sizeof(float),B,HB,BB,HB);

    	    if (status != CUBLAS_STATUS_SUCCESS) {

            fprintf (stderr, "!!!! device memory allocation error (A)\n");

            return EXIT_FAILURE;

    	    }

	    

	    /*KERNEL*/

            cublasSgemm('n','n',HA,WB,WA,1,AA,HA,BB,HB,0,CC,HC);

		

    	    status = cublasGetError();

    	    if (status != CUBLAS_STATUS_SUCCESS) {

            fprintf (stderr, "!!!! kernel execution error.\n");

            return EXIT_FAILURE;

            }

            cublasGetMatrix(HC,WC,sizeof(float),CC,HC,C,HC);

    	    if (status != CUBLAS_STATUS_SUCCESS) {

            fprintf (stderr, "!!!! device read error (A)\n");

            return EXIT_FAILURE;

    	    }

	    /* PERFORMANCE OUTPUT*/

	    printf("\nMatriz A:\n");

	    printMat(A,WA,HA);

	    printf("\nMatriz B:\n");

	    printMat(B,WB,HB);

	    printf("\nMatriz C:\n");

	    printMat(C,WC,HC);

free( A );  free( B );  free ( C );

            status = cublasFree(AA);

            if (status != CUBLAS_STATUS_SUCCESS) {

            fprintf (stderr, "!!!! memory free error (A)\n");

            return EXIT_FAILURE;

    	    }

    	    status = cublasFree(BB);

    	    if (status != CUBLAS_STATUS_SUCCESS) {

            fprintf (stderr, "!!!! memory free error (B)\n");

            return EXIT_FAILURE;

    	    }

    	    status = cublasFree(CC);

    	    if (status != CUBLAS_STATUS_SUCCESS) {

            fprintf (stderr, "!!!! memory free error (C)\n");

	    return EXIT_FAILURE;

	    }

/* Shutdown */

            status = cublasShutdown();

     	    if (status != CUBLAS_STATUS_SUCCESS) {

            fprintf (stderr, "!!!! shutdown error (A)\n");

            return EXIT_FAILURE;

    	    }

	    if (argc > 1) {

            if (!strcmp(argv[1], "-noprompt") ||!strcmp(argv[1], "-qatest") ) 

            {

		return EXIT_SUCCESS;

            }

            } 

            else

            {

                printf("\nPress ENTER to exit...\n");

                getchar();

            }

return EXIT_SUCCESS;

}

Think about cases where one dimension of a matrix might be padded (or the matrix is really a block of a large matrix). In that case the naive row major ordered transform is not the same at the fortran ordered original - the padding words must be written into the row major version and the dimensions of the matrix altered as well to reflect the presence of the padding.

Oh i see, so there won’t be any problem if you are working with two original matrices, that didnt suffer any kind of modification, right?

As you quoted me earlier- there is not necessarily a problem in passing row ordered storage to a fortran blas using the transposed versions of the routines, given that

[list=1]

You understand the structure of the arrays containing the matrices

A transposed version of the blas routine exists

The transpose version o the BLAS routine you want to use performs acceptably.

Yes thank you again for the enlightment, you deserve a statue! Could you just see what is problem with the CUBLAS code for non square matrices(the one sergioamendes posted)?The University computer does not let me sign in again, so there is really no rush, but if we fix that code, we can then move on to implement part of the OSEM algorithm we are supposed to do.

Your matrix print routine contains an error. The actual CUBLAS code itself is probbaly ok.

Hum…ok, although it’s strange because matrices A and B are printed all right. I am not at the university ATM, so i can only confirm this next monday. Nevertheless it is as i say, the only matrix that is not right is the resulting one.

The routine is indexing using HC all the time…

Yes, you are absolutely right. I think before I left the university I had already replaced it with uHP, but I am not sure. Thank you avidday, monday I will test it again ;). Have a nice weekend

avidday, i only had the oportunity to fix the code today, however the problem does not lie within the printing routine.

Notes:

1-If i use row major and then use the parameters ‘T’ in Sgemm i get this error for non square matrices: "** On entry to SGEMM parameter number 10 had an illegal value

!!! kernel execution error."

2-It works for non square matrices(Varying Matrix A Width ONLY)

Here are some outputs

Matrix A:

0.000000 2.000000 

1.000000 3.000000 

Matrix B:

0.000000 2.000000 4.000000 

1.000000 3.000000 5.000000 

Matrix C:

2.000000 6.000000 10.000000 

3.000000 11.000000 19.000000
Matrix A:

0.000000 2.000000 4.000000 6.000000 

1.000000 3.000000 5.000000 7.000000 

Matrix B:

0.000000 4.000000 

1.000000 5.000000 

2.000000 6.000000 

3.000000 7.000000 

Matrix C:

28.000000 76.000000 

34.000000 98.000000

I have reviewed the code many times, and I cant find anything wrong.

Sorry, is there a question in there somewhere?

I have no particular question, i have just a fried brain from the program not working properly. I have read a lot about CUBLAS, I am using column major order and all the other functions are taken from the simpleCUBLAS SDK example(except cublasSet/GetMatrix), so I don’t know why I am having wrong results. You, yourself, said the program should be fine and pointed out the printing routine error, don’t you see any more source of error(as minimal as it can be)?