simple matrix (or matrix vector) multiplication using CUBLAS


I’m new to CUDA and I am writing a matrix multiplier. It works as expected. However, I came across CUBLAS and thought it would be much wiser to use it instead. I’m not a math whiz, so reading the docs doesn’t make a lot of sense to me. Is there a function to multiply 2 matrices together, or a matrix and a vector? I don’t need to do any transformations or anything fancy - just simple matrix multiplication.


suppose you have build matrix A, B, C with size hA x wA, hB x wB and hA x wB

(number of rows of A is hA and number of columns of A is wA), then you can do

following code to compute C = A * B via CUBLAS

float *devPtrA , *devPtrB , *devPtrC;

cublasInit(); // initilization of CUDA application

cublasAlloc( hA*wA, sizeof(float), (void**) &devPtrA);

cublasAlloc( wA*wB, sizeof(float), (void**) &devPtrB);

cublasAlloc( hA*wB, sizeof(float), (void**) &devPtrC);

// transfer host data to device

cublasSetMatrix( hA, wA, sizeof(float), A, hA, devPtrA, hA);

cublasSetMatrix( wA, wB, sizeof(float), B, wA, devPtrB, wA);

// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

	cublasDgemm('N', 'N', hA, wB, wA, alpha, devPtrA, hA, 

		devPtrB, wA, beta, devPtrC, hA);

cublasGetMatrix(hA, wB, sizeof(float), devPtrC, hA, C, hA);	


I’m in the same boat, but I am not sure that cuBLAS is the way to go; BLAS/LAPACK don’t mix well with GPUs.

Besides, on GPUs memory bandwidth is king, which means that often it makes sense to trade accuracy/quality for performance, basically for the same reasons games use the lowest-resolution-textures they can get away with. So, in my mind it makes sense to support different “pixel formats” for matrices and vectors, too.

Here’s more thoughts on that subject, I’m looking for the answer as well:…n=ReCode.LAPACK.


yes, communication between CPU and GPU via PCI Express is a big problem,

for example, in PDE regime, best choice is to put the problem into device memory and do

computation on GPU, in this way, you can use thousands CUBLAS routines, this would be great.

Of course, most PDE is memory-bound, it’s limit is bandwidth of device memory, (100GB/s on TeslaC1060)

If you want to see how wrong that line of thinking can be, then I recommend you have a look at this report, and the accompanying code found in this thread, which present fully functional CUDA implementations of three of the most common matrix factorization routines in LAPACK (SGETRF, SPOTRF and SGEQRF) which run at over 300 single precision Gflops/s on current GT200 hardware in conjunction with a fast host side CPU and optimized host BLAS.

Thanks much for this! It’s a big help. :)

You will need to change cublasDgemm to cublasSgemm to get the correct answer.

I’m writing a CUDA program where matrix multiplication is only the part of a bigger algorithm. So can I just use already allocated device memory with results of previous kernel execution, i.e. not use cublasAlloc and cublasSetMatrix?

You can, but you need to remember that CUBLAS is modelled after Fortran BLAS implementations and expects column major order storage. Some of the algorithms in CUBLAS perform a lot better on 16 word aligned storage, so there can be benefits in padding your matrices and vectors, but cublasAlloc and cublasSetMatrix are effectively just wrappers for cudaMalloc and cudaMemcpy, and work the same way.

Thanks, I’ve tried, it works normaly.