simple matrix (or matrix vector) multiplication using CUBLAS

fender177 · November 5, 2009, 11:31pm

Hello,

I’m new to CUDA and I am writing a matrix multiplier. It works as expected. However, I came across CUBLAS and thought it would be much wiser to use it instead. I’m not a math whiz, so reading the docs doesn’t make a lot of sense to me. Is there a function to multiply 2 matrices together, or a matrix and a vector? I don’t need to do any transformations or anything fancy - just simple matrix multiplication.

Thanks,
-Andrew

LSChien · November 6, 2009, 1:49am

suppose you have build matrix A, B, C with size hA x wA, hB x wB and hA x wB

(number of rows of A is hA and number of columns of A is wA), then you can do

following code to compute C = A * B via CUBLAS

float *devPtrA , *devPtrB , *devPtrC;

cublasInit(); // initilization of CUDA application

cublasAlloc( hA*wA, sizeof(float), (void**) &devPtrA);

cublasAlloc( wA*wB, sizeof(float), (void**) &devPtrB);

cublasAlloc( hA*wB, sizeof(float), (void**) &devPtrC);

// transfer host data to device

cublasSetMatrix( hA, wA, sizeof(float), A, hA, devPtrA, hA);

cublasSetMatrix( wA, wB, sizeof(float), B, wA, devPtrB, wA);

// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

	cublasDgemm('N', 'N', hA, wB, wA, alpha, devPtrA, hA, 

		devPtrB, wA, beta, devPtrC, hA);

cublasGetMatrix(hA, wB, sizeof(float), devPtrC, hA, C, hA);	

cublasShutdown();

zajo · November 6, 2009, 2:47am

I’m in the same boat, but I am not sure that cuBLAS is the way to go; BLAS/LAPACK don’t mix well with GPUs.

Besides, on GPUs memory bandwidth is king, which means that often it makes sense to trade accuracy/quality for performance, basically for the same reasons games use the lowest-resolution-textures they can get away with. So, in my mind it makes sense to support different “pixel formats” for matrices and vectors, too.

Here’s more thoughts on that subject, I’m looking for the answer as well: [url=“http://www.revergestudios.com/reblog/index.php?n=ReCode.LAPACK”]http://www.revergestudios.com/reblog/index...n=ReCode.LAPACK[/url].

–Emil

LSChien · November 6, 2009, 7:20am

yes, communication between CPU and GPU via PCI Express is a big problem,

for example, in PDE regime, best choice is to put the problem into device memory and do

computation on GPU, in this way, you can use thousands CUBLAS routines, this would be great.

Of course, most PDE is memory-bound, it’s limit is bandwidth of device memory, (100GB/s on TeslaC1060)

avidday · November 6, 2009, 8:04am

If you want to see how wrong that line of thinking can be, then I recommend you have a look at this report, and the accompanying code found in this thread, which present fully functional CUDA implementations of three of the most common matrix factorization routines in LAPACK (SGETRF, SPOTRF and SGEQRF) which run at over 300 single precision Gflops/s on current GT200 hardware in conjunction with a fast host side CPU and optimized host BLAS.

fender177 · November 6, 2009, 8:49pm

suppose you have build matrix A, B, C with size hA x wA, hB x wB and hA x wB

(number of rows of A is hA and number of columns of A is wA), then you can do

following code to compute C = A * B via CUBLAS

float *devPtrA , *devPtrB , *devPtrC;

cublasInit(); // initilization of CUDA application

cublasAlloc( hA*wA, sizeof(float), (void**) &devPtrA);

cublasAlloc( wA*wB, sizeof(float), (void**) &devPtrB);

cublasAlloc( hA*wB, sizeof(float), (void**) &devPtrC);

// transfer host data to device

cublasSetMatrix( hA, wA, sizeof(float), A, hA, devPtrA, hA);

cublasSetMatrix( wA, wB, sizeof(float), B, wA, devPtrB, wA);

// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

	cublasDgemm('N', 'N', hA, wB, wA, alpha, devPtrA, hA, 

		devPtrB, wA, beta, devPtrC, hA);

cublasGetMatrix(hA, wB, sizeof(float), devPtrC, hA, C, hA);	

cublasShutdown();

Thanks much for this! It’s a big help. :)

mfatica · November 6, 2009, 9:21pm

You will need to change cublasDgemm to cublasSgemm to get the correct answer.

sirotenko · November 25, 2009, 9:20am

suppose you have build matrix A, B, C with size hA x wA, hB x wB and hA x wB

(number of rows of A is hA and number of columns of A is wA), then you can do

following code to compute C = A * B via CUBLAS

float *devPtrA , *devPtrB , *devPtrC;

cublasInit(); // initilization of CUDA application

cublasAlloc( hA*wA, sizeof(float), (void**) &devPtrA);

cublasAlloc( wA*wB, sizeof(float), (void**) &devPtrB);

cublasAlloc( hA*wB, sizeof(float), (void**) &devPtrC);

// transfer host data to device

cublasSetMatrix( hA, wA, sizeof(float), A, hA, devPtrA, hA);

cublasSetMatrix( wA, wB, sizeof(float), B, wA, devPtrB, wA);

// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

	cublasDgemm('N', 'N', hA, wB, wA, alpha, devPtrA, hA, 

		devPtrB, wA, beta, devPtrC, hA);

cublasGetMatrix(hA, wB, sizeof(float), devPtrC, hA, C, hA);	

cublasShutdown();

I’m writing a CUDA program where matrix multiplication is only the part of a bigger algorithm. So can I just use already allocated device memory with results of previous kernel execution, i.e. not use cublasAlloc and cublasSetMatrix?

avidday · November 25, 2009, 9:37am

You can, but you need to remember that CUBLAS is modelled after Fortran BLAS implementations and expects column major order storage. Some of the algorithms in CUBLAS perform a lot better on 16 word aligned storage, so there can be benefits in padding your matrices and vectors, but cublasAlloc and cublasSetMatrix are effectively just wrappers for cudaMalloc and cudaMemcpy, and work the same way.

sirotenko · November 25, 2009, 2:36pm

Thanks, I’ve tried, it works normaly.

Topic		Replies	Views
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	10145	January 20, 2010
Matrix multiplication performance CUDA Programming and Performance	2	1128	August 3, 2013
Matrix Multiplication with CUBLAS and MATLAB CUDA Programming and Performance	0	1011	July 1, 2009
Matrix Multiplication Help CUDA Programming and Performance	5	3871	August 19, 2009
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18222	March 30, 2011
CUBLAS issues Some simple question about CUBLAS CUDA Programming and Performance	1	1270	August 22, 2011
cublas matrix format/normal vector format CUDA Programming and Performance	2	3234	May 12, 2009
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7536	September 1, 2010
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3462	December 26, 2008
matrix multiplication multiplication of two matrix CUDA Programming and Performance	5	4615	August 11, 2010

simple matrix (or matrix vector) multiplication using CUBLAS

Related topics