Matrix Multiplication with CUBLAS and MATLAB

David_Lisin · July 1, 2009, 3:25pm

Good afternoon, from what ive been working on (Matlab and CUDA), here is a possible solution for anyone interested in Generic Matrix Multiplication. I use the cublas library, but the following code is to demonstrate how its possible to call cublas directly from cuda. I hope it serves all who are interested. Good luck to everyone in CUDA, David Lisin.

This is the matrixMult.m file:

[codebox]

dimension1=4000;

dimension2=4000;

A = diag(rand(dimension1,dimension1))*ones(1,dimension2);

B=ones(dimension2,dimension1);

disp(‘Matlab:’)

tic

C1=A*B;

toc

disp(‘Cuda:’)

tic

C2=matrixMulCUDA(A,B);

toc

[/codebox]

This is the actual cuda file (matrixMulCUDA.cu):

[codebox]

#pragma comment (lib,“C:\CUDA\lib\cublas.lib”)

#include “cuda.h”

#include “mex.h”

#include <string.h>

#include “cublas.h”

/* Gateway function */

void mexFunction(int nlhs, mxArray *plhs,int nrhs, const mxArray *prhs){

int mMatrizA;

int nMatrizA;

int mMatrizB;

int nMatrizB;

int mSalidaFuncion;

int nSalidaFuncion;

const mxArray *paramMatrizA;

double *matrizA;

const mxArray *paramMatrizB;

double *matrizB;

double *matrizAf, *matrizA_gpu;

double *matrizBf, *matrizB_gpu;

double *salidaFuncion;

double *salidaFuncionf, *salidaFuncion_gpu;	

int j;

/* ############################################################

######## */

/* Get the dimensions of the matrizA */

paramMatrizA=prhs[0];

/* We collect the number of params */

mMatrizA= mxGetM(paramMatrizA);

nMatrizA=mxGetN(paramMatrizA);

/* ############################################################

######## */

/* Get the dimensions of the matrizB */

paramMatrizB=prhs[1];

/* We collect the number of params */

mMatrizB= mxGetM(paramMatrizB);

nMatrizB= mxGetN(paramMatrizB);

/* ############################################################

######## */

/*###########################################################

############*/

mSalidaFuncion=mMatrizA;

nSalidaFuncion=nMatrizB;

/*###########################################################

############*/

/* ############################################################

######## */

/* Create an mxArray for the output data */

if(!(plhs[0]=mxCreateDoubleMatrix(mSalidaFuncion,nSalidaFunc

ion,mxREAL))){

    printf("\nERROR\n");

    return;

}

/* ############################################################

######## */

/* Creamos los arrays de input y arrays de salida del GPU */

cudaMalloc((void**)&matrizA_gpu,sizeof(double)*mMatrizA*nMatrizA);

cudaMalloc((void**)&matrizB_gpu,sizeof(double)*mMatrizB*nMatrizB);

cudaMalloc((void**)&salidaFuncion_gpu,sizeof(double)*mSalidaFuncion*nSalidaF

uncion);

/* ############################################################

######## */

/* Recogemos los datos del input */

matrizA=mxGetPr(prhs[0]);

matrizB=mxGetPr(prhs[1]);

/* ############################################################

######## */

matrizAf=(double*)mxMalloc(sizeof(double)*mMatrizA*nMatrizA)

;

for(j=0;j<nMatrizA*mMatrizA;j++){

    matrizAf[j]=matrizA[j];

}    

cudaMemcpy(matrizA_gpu,matrizAf,sizeof(double)*mMatrizA*nMat

rizA,cudaMemcpyHostToDevice);

/* ############################################################

######## */

matrizBf=(double*)mxMalloc(sizeof(double)*mMatrizB*nMatrizB)

;

for(j=0;j<nMatrizB*mMatrizB;j++){

    matrizBf[j]=matrizB[j];

}    

cudaMemcpy(matrizB_gpu,matrizBf,sizeof(double)*mMatrizB*nMat

rizB,cudaMemcpyHostToDevice);

/* ############################################################

######## */

/* ############################################################

####### */

/* Reservamos la memoria de la salida de doble precision */

salidaFuncionf=(double*)mxMalloc(sizeof(double)*mSalidaFunci

on*nSalidaFuncion);

/* ############################################################

*/

/* READY TO CALL SGEMM */

//(void) cublasDgemm ('n','n',mMatrizA,nMatrizB,nMatrizA,1,matrizA,mMatrizA,matrizB_g

pu,mMatrizB,0,salidaFuncion_gpu,mMatrizA);

/* READY TO CALL SGEMM */

(void) cublasDgemm ('n','n',mMatrizA,nMatrizB,nMatrizA,1,matrizA_gpu,mMatrizA,matri

zB_gpu,mMatrizB,0,salidaFuncion_gpu,mMatrizA);

// block until the device has completed

cudaThreadSynchronize();

/* Copy result back to host */

cudaMemcpy( salidaFuncionf, salidaFuncion_gpu, sizeof(double)*mSalidaFuncion*nSalidaFuncion, cudaMemcpyDeviceToHost);

/* Create a pointer to the output data */

salidaFuncion = mxGetPr(plhs[0]);

for(j=0;j<nSalidaFuncion*mSalidaFuncion;j++){

        salidaFuncion[j]=(double)salidaFuncionf[j];       

}

/* Clean-up memory on device and host */

mxFree(matrizAf);

    mxFree(matrizBf);

    mxFree(salidaFuncionf);

cudaFree(matrizA_gpu);	

cudaFree(matrizB_gpu);	

cudaFree(salidaFuncion_gpu);

}

[/codebox]

The output is:

Matlab:

Elapsed time is 21.544856 seconds.

Cuda:

Elapsed time is 2.545485 seconds.

So not too bad :)

Im using double precision matrix multiplication DGEMM, but if you dont need double precision, use single precision SGEMM, as it has better times.

Again, good luck to everyone working with CUDA :)

jam11 · July 6, 2009, 10:43am

And how would you make these two pieces of code talk to each other.
I am using Octave.
As far as I understand, I have to compile the .cu into a mex file before I can call the function matrixMulCUDA, but mkoctfile calls only the gcc compiler and .cu is not a valid extension.
How to I get nvcc into the mix?
Thank you.

mfatica · July 6, 2009, 11:15am

I use the following Makefile.

You first create an object file from the .cu using nvcc and then use mkoctfile to create the mex.

# Define installation location for CUDA and compilation flags compatible

# with the CUDA include files.

CUDAHOME	= /usr/local/cuda

INCLUDEDIR  = -I$(CUDAHOME)/include

INCLUDELIB  = -L$(CUDAHOME)/lib -lcufft -Wl,-rpath,$(CUDAHOME)/lib

CFLAGS	  = -fPIC -D_GNU_SOURCE -pthread -fexceptions

COPTIMFLAGS = -O3 -funroll-loops -msse2

# Define installation location for Octave

MKOCTFILE = /Applications/Octave.app/Contents/Resources/bin/mkoctfile

MEXEXT		= .mex

# List the mex files to be built.  

MEXFILES = fft2_cuda.mex	   \

		   fft2_cuda_sp_dp.mex \

		   ifft2_cuda.mex	  \

		   Szeta.mex

all: $(MEXFILES)

clean:

	rm -f $(MEXFILES) *.linkinfo 

.SUFFIXES: .cu .o .mex

.cu.o: 

	nvcc -c COPTIMFLAGS='$(COPTIMFLAGS)'  $< $(INCLUDEDIR) $(INCLUDELIB)

.mex: 

	$(MKOCTFILE) CFLAGS='$(CFLAGS)' COPTIMFLAGS='$(COPTIMFLAGS)' $@ \

		$(INCLUDEDIR) $(INCLUDELIB)

jam11 · July 6, 2009, 4:27pm

No success with your makefile.
If I understand this, the .cu file is compile into a .cu.o file and this file is then used as an argument to mkoctfile.

I still have the same problem: mkocfile does not accept and object file .o as an argument.

Are your source files only .cu ?

I have Octave-3.0.0

mfatica · July 6, 2009, 9:12pm

Which OS?
I tested my Makefile on both MacosX and Linux and it worked just fine.
Are you sure you are using the mkoctfile from version 3 and not from version 2?
Check with mkoctfile -v

jam11 · July 7, 2009, 1:35am

The mkoctfile version is 3.0.0

But the make program is:

GNU Make 3.81
This program built for x86_64-slackware-linux-gnu

This seems to be the problem. This linux distro is not the final release. There could still be some bugs. But this is the first makefile which give me trouble.

The mkoctfile does accept the .o files after all, I was looking at the wrong place for the error.
I am doing the compilation without make but still have to chase some segfault violation.

mfatica · July 7, 2009, 1:55am

Check that there are tabs not spaces in the Makefile and add -lcublas if you want to use the mex file in this thread.
I uploaded the file, it should preserve the tabs better than the cut and paste.
Makefile_Octave.txt (896 Bytes)

Boxed_Cylon · July 8, 2009, 8:46am

A few of us worked on a similar example of matrix multiplication in matlab some time ago. I wrote up the issues we worked on and gave similar example code (and other examples) in a small tutorial, available here:

[url=“http://forums.nvidia.com/index.php?showtopic=70731”]http://forums.nvidia.com/index.php?showtopic=70731[/url]

If one pads the arrays with zeros to be dimensioned to multiples of 16, 32, etc., cublas is meant to be more efficient.

David_Lisin · July 17, 2009, 9:57am

Good Morning,

adding does effectively improve multiplication times, but obviously involves additional time when padding and unpadding the matrix to return to matlab, thus I did not include padding with the example. My intention was to give a generic example of matrix multiplication, without any aditional elements to take into account. But as you have very well said, matrix dimensions multiple of 16, 32 and 64 give much better multiplication times.

Yours sincerely,

David Lisin

huangch · August 14, 2009, 5:47am

Hi.

Thanks for the contribution.

However the results of the code seems not correct. I think that might be because of my configuration. I’m using cuda-2.2 on Linux, Win32, OSX.

Any idea?

There is the result:

Matlab:

C1 =

0.9165    0.1359
0.9290    0.1377

Elapsed time is 0.001535 seconds.
Cuda:

C2 =

 0     0
 0     0

Elapsed time is 0.225618 seconds.

== begin of code ==
dimension1=2;
dimension2=1;

A = diag(rand(dimension1,dimension1))*ones(1,dimension2);
% B=ones(dimension2,dimension1);
B = diag(rand(dimension1,dimension1))ones(1,dimension2);
B = B’;
disp(‘Matlab:’)
tic
C1=AB;
C1
toc

disp(‘Cuda:’)
tic
C2=matrixMulCUDA(A,B);
C2
toc
=== end of code ===

Topic		Replies	Views
Matrix Multiplication with CUBLAS and MATLAB CUDA Programming and Performance	0	1003	July 1, 2009
Non-square Matrix Multiplication, not getting any cValues back CUDA Programming and Performance	2	3389	July 1, 2009
Help with CUDA+Matlab acceleration CUDA Programming and Performance	24	14488	December 17, 2008
Matrix Multiplication In CUDA CUDA Programming and Performance	6	2525	May 11, 2015
Matlab nvmex compiling and problems CUDA Programming and Performance	3	10646	July 8, 2009
Matlab & CUDA Cuda scripts executed from Matlab CUDA Programming and Performance	14	10263	September 9, 2015
Matlab 2007b and CUDA Can't compile CUDA example with Matlab CUDA Programming and Performance	19	20099	June 25, 2009
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	9993	January 18, 2012
Matrix Multiplication Garbage value :( CUDA Programming and Performance	10	3399	July 25, 2009
Compiling error in Ubuntu 14.04, CUDA Toolkit v7.5 CUDA Programming and Performance	5	2363	November 13, 2015

Matrix Multiplication with CUBLAS and MATLAB

*/

Related topics