Matrix multiplication performance

zoharl · August 3, 2013, 8:34pm

Hi,

I would like to confirm that the performance of matrix multiplication using cublas is dominated by host <–> device memory transfer. More specifically:

A is 172974x241
B is 241x3
(C is 172974x3)

VS profiler results:

A (in my case) is already on the device
cublasAlloc + cublasSetMatrix of B 0.6%
cublasAlloc of C <0.1%
cublasSgemm 0.1%
cublasGetMatrix of C 1.2%
cublasFree of B 0.1%
cublasFree of C 0.3%

My video card (laptop) is nvidia quadro 3000m.

Does this make sense to you?

Zohar

zoharl · August 3, 2013, 8:35pm

// Compute C = B*A, where A is already on device, B is on host and the result should be 
// uploaded to host data at C. B is m x k, A is k x n, and C is m x n.
template<class T>
bool multMatrixOnDeviceRight(T* d_A, T* h_B, T* h_C, int m, int n, int k)
{
//Stopwatch::global.start_print();

	// Upload h_B to d_B
	T* d_B = NULL;
	cublasStatus blasStat = cublasAlloc(m*k, sizeof(T), (void**)&d_B);
	if ( blasStat != CUBLAS_STATUS_SUCCESS ) {
		cout << "GPU device memory allocation failed!" << endl;
		return false;
	}
	blasStat = cublasSetMatrix(m, k, sizeof(T), h_B, m, d_B, m);

	// Alloc d_C
	T* d_C = NULL;
	blasStat = cublasAlloc(m*n, sizeof(T), (void**)&d_C);
	if ( blasStat != CUBLAS_STATUS_SUCCESS ) {
		cout << "GPU device memory allocation failed!" << endl;
		return false;
	}

	if ( blasStat != CUBLAS_STATUS_SUCCESS ) {
		cout << "GPU data download failed!" << endl;
		cublasFree(d_B);
		d_B = NULL;
		return false;
	}

	// mult matrices on GPU
	if ( sizeof(T) == 4 )
		cublasSgemm('N', 'N', m, n, k, 1.0f, (float*)d_B, m, (float*)d_A, k, 0.0f, (float*)d_C, m);
	else
		cublasDgemm('N', 'N', m, n, k, 1.0f, (double*)d_B, m, (double*)d_A, k, 0.0f, (double*)d_C, m);

	// upload the result back to CPU
	blasStat = cublasGetMatrix(m, n, sizeof(T), d_C, m, h_C, m);
	if ( blasStat != CUBLAS_STATUS_SUCCESS )	{
		cout << "Data upload from GPU to CPU failed!" << endl;
		cublasFree(d_C);
		d_C = NULL;
		return false;
	}
	if ( d_B )	{
		cublasFree(d_B);
		d_B = NULL;
	}
	if ( d_C )	{
		cublasFree(d_C);
		d_C = NULL;
	}
//Stopwatch::global.print();
	return true;
}

CudaaduC · August 3, 2013, 9:49pm

I think the quadro is really a GPU intended for visualization, and has a PCI-e 2.0 limit.

A more general(but related) question is;

Is there an inherent advantage to using the ‘cublasAlloc()’ and related cublas* memory functions when compared to the more standard cudaMalloc() and cudaMemcpy() functions?

I generally uses cudaMalloc() etc and seem to have very fast memory transfer times(using PCIe 2.0 x8, uggh).

Are those cublas memory calls just wrappers or something more efficient?

Topic		Replies	Views
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5654	November 25, 2009
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18222	March 30, 2011
CUBLAS memory management cublasAlloc and cublasSetMatrix/cublasGetMatrix CUDA Programming and Performance	0	2054	May 24, 2009
cublas matrix format/normal vector format CUDA Programming and Performance	2	3234	May 12, 2009
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3462	December 26, 2008
CUBLAS VS CUDA Kernel CUDA Programming and Performance	2	6830	August 15, 2007
cublasSgemv & TransferTime CUDA Programming and Performance	3	10331	August 18, 2007
Comparing DGEMM - Intel MKL and Cublas Legacy PGI Compilers	3	21001	September 9, 2010
cublas - cublasSgemm - problem CUDA Programming and Performance	2	2127	March 16, 2010
Why cublasGetMatrix slower than cublasSetMatrix CUDA Programming and Performance	1	7450	April 16, 2008

Matrix multiplication performance

Related topics