Matrix Multiplication with CUBLAS and MATLAB

Good afternoon, from what ive been working on (Matlab and CUDA), here is a possible solution for anyone interested in Generic Matrix Multiplication. I use the cublas library, but the following code is to demonstrate how its possible to call cublas directly from cuda. I hope it serves all who are interested. Good luck to everyone in CUDA, David Lisin.

This is the matrixMult.m file:

[codebox]

dimension1=4000;

dimension2=4000;

A = diag(rand(dimension1,dimension1))*ones(1,dimension2);

B=ones(dimension2,dimension1);

disp(‘Matlab:’)

tic

C1=A*B;

toc

disp(‘Cuda:’)

tic

C2=matrixMulCUDA(A,B);

toc

[/codebox]

This is the actual cuda file (matrixMulCUDA.cu):

[codebox]

#pragma comment (lib,“C:\CUDA\lib\cublas.lib”)

#include “cuda.h”

#include “mex.h”

#include <string.h>

#include “cublas.h”

/* Gateway function */

void mexFunction(int nlhs, mxArray *plhs,int nrhs, const mxArray *prhs){

int mMatrizA;

int nMatrizA;

int mMatrizB;

int nMatrizB;

int mSalidaFuncion;

int nSalidaFuncion;

const mxArray *paramMatrizA;

double *matrizA;

const mxArray *paramMatrizB;

double *matrizB;

double *matrizAf, *matrizA_gpu;

double *matrizBf, *matrizB_gpu;

double *salidaFuncion;

double *salidaFuncionf, *salidaFuncion_gpu;	

int j;

/* ############################################################

######## */

/* Get the dimensions of the matrizA */

paramMatrizA=prhs[0];

/* We collect the number of params */

mMatrizA= mxGetM(paramMatrizA);

nMatrizA=mxGetN(paramMatrizA);

/* ############################################################

######## */

/* Get the dimensions of the matrizB */

paramMatrizB=prhs[1];

/* We collect the number of params */

mMatrizB= mxGetM(paramMatrizB);

nMatrizB= mxGetN(paramMatrizB);

/* ############################################################

######## */

/*###########################################################

############*/

mSalidaFuncion=mMatrizA;

nSalidaFuncion=nMatrizB;     

/*###########################################################

############*/

/* ############################################################

######## */

/* Create an mxArray for the output data */

if(!(plhs[0]=mxCreateDoubleMatrix(mSalidaFuncion,nSalidaFunc

ion,mxREAL))){

    printf("\nERROR\n");

    return;

}

/* ############################################################

######## */

/* Creamos los arrays de input y arrays de salida del GPU */

cudaMalloc((void**)&matrizA_gpu,sizeof(double)*mMatrizA*nMatrizA);

cudaMalloc((void**)&matrizB_gpu,sizeof(double)*mMatrizB*nMatrizB);

cudaMalloc((void**)&salidaFuncion_gpu,sizeof(double)*mSalidaFuncion*nSalidaF

uncion);

/* ############################################################

######## */

/* Recogemos los datos del input */

matrizA=mxGetPr(prhs[0]);

matrizB=mxGetPr(prhs[1]);

/* ############################################################

######## */

matrizAf=(double*)mxMalloc(sizeof(double)*mMatrizA*nMatrizA)

;

for(j=0;j<nMatrizA*mMatrizA;j++){

    matrizAf[j]=matrizA[j];

}    

cudaMemcpy(matrizA_gpu,matrizAf,sizeof(double)*mMatrizA*nMat

rizA,cudaMemcpyHostToDevice);

/* ############################################################

######## */

matrizBf=(double*)mxMalloc(sizeof(double)*mMatrizB*nMatrizB)

;

for(j=0;j<nMatrizB*mMatrizB;j++){

    matrizBf[j]=matrizB[j];

}    

cudaMemcpy(matrizB_gpu,matrizBf,sizeof(double)*mMatrizB*nMat

rizB,cudaMemcpyHostToDevice);

/* ############################################################

######## */

/* ############################################################

####### */

/* Reservamos la memoria de la salida de doble precision */

salidaFuncionf=(double*)mxMalloc(sizeof(double)*mSalidaFunci

on*nSalidaFuncion);

/* ############################################################

*/

/* READY TO CALL SGEMM */

//(void) cublasDgemm ('n','n',mMatrizA,nMatrizB,nMatrizA,1,matrizA,mMatrizA,matrizB_g

pu,mMatrizB,0,salidaFuncion_gpu,mMatrizA);

/* READY TO CALL SGEMM */

(void) cublasDgemm ('n','n',mMatrizA,nMatrizB,nMatrizA,1,matrizA_gpu,mMatrizA,matri

zB_gpu,mMatrizB,0,salidaFuncion_gpu,mMatrizA);

// block until the device has completed

cudaThreadSynchronize();

/* Copy result back to host */

cudaMemcpy( salidaFuncionf, salidaFuncion_gpu, sizeof(double)*mSalidaFuncion*nSalidaFuncion, cudaMemcpyDeviceToHost);

/* Create a pointer to the output data */

salidaFuncion = mxGetPr(plhs[0]);

for(j=0;j<nSalidaFuncion*mSalidaFuncion;j++){

        salidaFuncion[j]=(double)salidaFuncionf[j];       

}

/* Clean-up memory on device and host */

mxFree(matrizAf);

    mxFree(matrizBf);

    mxFree(salidaFuncionf);

cudaFree(matrizA_gpu);	

cudaFree(matrizB_gpu);	

cudaFree(salidaFuncion_gpu);        

}

[/codebox]

The output is:

Matlab:

Elapsed time is 21.544856 seconds.

Cuda:

Elapsed time is 2.545485 seconds.

So not too bad :)

Im using double precision matrix multiplication DGEMM, but if you dont need double precision, use single precision SGEMM, as it has better times.

Again, good luck to everyone working with CUDA :)

And how would you make these two pieces of code talk to each other.
I am using Octave.
As far as I understand, I have to compile the .cu into a mex file before I can call the function matrixMulCUDA, but mkoctfile calls only the gcc compiler and .cu is not a valid extension.
How to I get nvcc into the mix?
Thank you.

I use the following Makefile.

You first create an object file from the .cu using nvcc and then use mkoctfile to create the mex.

# Define installation location for CUDA and compilation flags compatible

# with the CUDA include files.

CUDAHOME	= /usr/local/cuda

INCLUDEDIR  = -I$(CUDAHOME)/include

INCLUDELIB  = -L$(CUDAHOME)/lib -lcufft -Wl,-rpath,$(CUDAHOME)/lib

CFLAGS	  = -fPIC -D_GNU_SOURCE -pthread -fexceptions

COPTIMFLAGS = -O3 -funroll-loops -msse2

# Define installation location for Octave

MKOCTFILE = /Applications/Octave.app/Contents/Resources/bin/mkoctfile

MEXEXT		= .mex

# List the mex files to be built.  

MEXFILES = fft2_cuda.mex	   \

		   fft2_cuda_sp_dp.mex \

		   ifft2_cuda.mex	  \

		   Szeta.mex

all: $(MEXFILES)

clean:

	rm -f $(MEXFILES) *.linkinfo 

.SUFFIXES: .cu .o .mex

.cu.o: 

	nvcc -c COPTIMFLAGS='$(COPTIMFLAGS)'  $< $(INCLUDEDIR) $(INCLUDELIB)

.mex: 

	$(MKOCTFILE) CFLAGS='$(CFLAGS)' COPTIMFLAGS='$(COPTIMFLAGS)' $@ \

		$(INCLUDEDIR) $(INCLUDELIB)

No success with your makefile.
If I understand this, the .cu file is compile into a .cu.o file and this file is then used as an argument to mkoctfile.

I still have the same problem: mkocfile does not accept and object file .o as an argument.

Are your source files only .cu ?

I have Octave-3.0.0

Which OS?
I tested my Makefile on both MacosX and Linux and it worked just fine.
Are you sure you are using the mkoctfile from version 3 and not from version 2?
Check with mkoctfile -v

The mkoctfile version is 3.0.0

But the make program is:

GNU Make 3.81
This program built for x86_64-slackware-linux-gnu

This seems to be the problem. This linux distro is not the final release. There could still be some bugs. But this is the first makefile which give me trouble.

The mkoctfile does accept the .o files after all, I was looking at the wrong place for the error.
I am doing the compilation without make but still have to chase some segfault violation.

Check that there are tabs not spaces in the Makefile and add -lcublas if you want to use the mex file in this thread.
I uploaded the file, it should preserve the tabs better than the cut and paste.
Makefile_Octave.txt (896 Bytes)

A few of us worked on a similar example of matrix multiplication in matlab some time ago. I wrote up the issues we worked on and gave similar example code (and other examples) in a small tutorial, available here:

[url=“http://forums.nvidia.com/index.php?showtopic=70731”]http://forums.nvidia.com/index.php?showtopic=70731[/url]

If one pads the arrays with zeros to be dimensioned to multiples of 16, 32, etc., cublas is meant to be more efficient.

Good Morning,

adding does effectively improve multiplication times, but obviously involves additional time when padding and unpadding the matrix to return to matlab, thus I did not include padding with the example. My intention was to give a generic example of matrix multiplication, without any aditional elements to take into account. But as you have very well said, matrix dimensions multiple of 16, 32 and 64 give much better multiplication times.

Yours sincerely,

David Lisin

Hi.

Thanks for the contribution.

However the results of the code seems not correct. I think that might be because of my configuration. I’m using cuda-2.2 on Linux, Win32, OSX.

Any idea?

There is the result:

Matlab:

C1 =

0.9165    0.1359
0.9290    0.1377

Elapsed time is 0.001535 seconds.
Cuda:

C2 =

 0     0
 0     0

Elapsed time is 0.225618 seconds.

== begin of code ==
dimension1=2;
dimension2=1;

A = diag(rand(dimension1,dimension1))*ones(1,dimension2);
% B=ones(dimension2,dimension1);
B = diag(rand(dimension1,dimension1))ones(1,dimension2);
B = B’;
disp(‘Matlab:’)
tic
C1=A
B;
C1
toc

disp(‘Cuda:’)
tic
C2=matrixMulCUDA(A,B);
C2
toc
=== end of code ===