Alternatives to cublas matrix multiplication (row major)

Noel_Lopes · March 16, 2010, 9:49am

Are there alternatives to cublas using row-major?

Using column major is not always an option as I might have kernels transforming data (that might be faster when using row major).

avidday · March 16, 2010, 10:56am

If you are really certain about the structure of your matrices, you can use the transpose versions of gemv() and gemm() with row major ordered data.

Noel_Lopes · March 16, 2010, 3:44pm

Thank you. Are threre any performance differences between using Normal and Transposed matrix versions?

MMB · March 26, 2010, 3:34am

Why don’t you try it, and then report back with the results?

Noel_Lopes · March 27, 2010, 10:01pm

:)

Best performance is achieved when both matrices are in column-major mode.

YDD · March 28, 2010, 7:16pm

Can you give plots? What’s the penalty for doing the transpose as a function of matrix size?

Noel_Lopes · April 6, 2010, 11:26pm

I did not make a plot but I did create a small program that multiplies 2 matrices in all possible modes (requires GPUMLib available at http://gpumlib.sourceforge.net/)

here is the source code (it turns out that I was wrong and the best order for the matrices is not always column-major, but depends on the their size and maybe on the device):

#include "../CudaInit.h"

#include "../../code/memory/DeviceMatrix.h"

#include <cublas.h>

#define ROWS_A (2000)

#define COLS_A (3000)

#define ROWS_B COLS_A

#define COLS_B (4000)

// Xc and Xr represent the same matrix in column-major order (Xc) and row-major order (Xr)

void RandomizeMatrix(HostMatrix<cudafloat> & Xc, HostMatrix<cudafloat> & Xr) {

	assert(Xc.Rows() == Xr.Rows() && Xc.Columns() == Xr.Columns());

	int rows = Xc.Rows();

	int columns = Xc.Columns();

	for(int r = 0; r < rows; r++) {

		for(int c = 0; c < columns; c++) {

			Xc(r, c) = Xr(r, c) = rand() / RAND_MAX;

		}

	}	

}

int main(int argc, char* argv[]) {

	CudaDevice device;

	if(!device.SupportsCuda()) {

		cout << "Device does not support cuda" << endl;

		return 0;

	}

	device.ShowInfo();

	cublasInit();

	HostMatrix<cudafloat> Ac(ROWS_A, COLS_A, ColumnMajor); // A column major

	HostMatrix<cudafloat> Ar(ROWS_A, COLS_A, RowMajor); // A row major

	RandomizeMatrix(Ac, Ar);

	HostMatrix<cudafloat> Bc(ROWS_B, COLS_B, ColumnMajor); // B column major

	HostMatrix<cudafloat> Br(ROWS_B, COLS_B, RowMajor); // B row major

	RandomizeMatrix(Bc, Br);

	DeviceMatrix<cudafloat> d_Ac(Ac);

	DeviceMatrix<cudafloat> d_Ar(Ar);

	DeviceMatrix<cudafloat> d_Bc(Bc);

	DeviceMatrix<cudafloat> d_Br(Br);

	

	DeviceMatrix<cudafloat> d_Result(ROWS_A, COLS_B, ColumnMajor);

	cout << endl << "Ac x Bc" << endl;

	size_t initialTime = clock();

	DeviceMatrix<cudafloat>::Multiply(d_Ac, d_Bc, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	int time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cout << endl << "Ar x Bc" << endl;

	initialTime = clock();

	DeviceMatrix<cudafloat>::Multiply(d_Ar, d_Bc, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cout << endl << "Ac x Br" << endl;

	initialTime = clock();

	DeviceMatrix<cudafloat>::Multiply(d_Ac, d_Br, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cout << endl << "Ar x Br" << endl;

	initialTime = clock();

	DeviceMatrix<cudafloat>::Multiply(d_Ar, d_Br, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cublasShutdown();

	return 0;

}

And here are the results on a 8600 GT (Ac - A in column-major | Ar - A in row-major | Bc - B in column major | Br - B in row-major):

Ac x Bc

Time: 2203 ms

Ar x Bc

Time: 2268 ms

Ac x Br

Time: 2155 ms

Ar x Br

Time: 2257 ms

avidday · April 7, 2010, 7:09am

On the basis of those measurements, do you believe there is actually any difference between the four cases?

Noel_Lopes · April 7, 2010, 4:01pm

I changed the program to include several iterations:

#include "../CudaInit.h"

#include "../../code/memory/DeviceMatrix.h"

#include <cublas.h>

#define ROWS_A (2000)

#define COLS_A (3000)

#define ROWS_B COLS_A

#define COLS_B (4000)

#define ITERATIONS (100)

// Xc and Xr represent the same matrix in column-major order (Xc) and row-major order (Xr)

void RandomizeMatrix(HostMatrix<cudafloat> & Xc, HostMatrix<cudafloat> & Xr) {

	assert(Xc.Rows() == Xr.Rows() && Xc.Columns() == Xr.Columns());

	int rows = Xc.Rows();

	int columns = Xc.Columns();

	for(int r = 0; r < rows; r++) {

		for(int c = 0; c < columns; c++) {

			Xc(r, c) = Xr(r, c) = rand() / RAND_MAX;

		}

	}	

}

int main(int argc, char* argv[]) {

	CudaDevice device;

	if(!device.SupportsCuda()) {

		cout << "Device does not support cuda" << endl;

		return 0;

	}

	device.ShowInfo();

	cublasInit();

	HostMatrix<cudafloat> Ac(ROWS_A, COLS_A, ColumnMajor); // A column major

	HostMatrix<cudafloat> Ar(ROWS_A, COLS_A, RowMajor); // A row major

	RandomizeMatrix(Ac, Ar);

	HostMatrix<cudafloat> Bc(ROWS_B, COLS_B, ColumnMajor); // B column major

	HostMatrix<cudafloat> Br(ROWS_B, COLS_B, RowMajor); // B row major

	RandomizeMatrix(Bc, Br);

	DeviceMatrix<cudafloat> d_Ac(Ac);

	DeviceMatrix<cudafloat> d_Ar(Ar);

	DeviceMatrix<cudafloat> d_Bc(Bc);

	DeviceMatrix<cudafloat> d_Br(Br);

	

	DeviceMatrix<cudafloat> d_Result(ROWS_A, COLS_B, ColumnMajor);

	cout << endl << "Ac x Bc" << endl;

	size_t initialTime = clock();

	for(int i = 0; i < ITERATIONS; i++) 

		DeviceMatrix<cudafloat>::Multiply(d_Ac, d_Bc, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	int time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cout << endl << "Ar x Bc" << endl;

	initialTime = clock();

	for(int i = 0; i < ITERATIONS; i++) 

		DeviceMatrix<cudafloat>::Multiply(d_Ar, d_Bc, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cout << endl << "Ac x Br" << endl;

	initialTime = clock();

	for(int i = 0; i < ITERATIONS; i++) 

		DeviceMatrix<cudafloat>::Multiply(d_Ac, d_Br, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cout << endl << "Ar x Br" << endl;

	initialTime = clock();

	for(int i = 0; i < ITERATIONS; i++) 

		DeviceMatrix<cudafloat>::Multiply(d_Ar, d_Br, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cublasShutdown();

	return 0;

}

Here are the results for 100 iterations (8600 GT):

Ac x Bc

Time: 220650 ms

Ar x Bc

Time: 227870 ms

Ac x Br

Time: 215042 ms

Ar x Br

Time: 224910 ms

This does not allow me to do any conclusions. But I would give some advices:

If you are just multiplying a few matrices, probably it does not matter if you store your matrices in column-major or row-major (at least the effort of testing which method is better may not compensate).
If you need to multiply thousands or millions of matrices (for instance in an interactive method) it is worth to spend some time checking what is the best method for storing information in each matrix.
If you use additional kernels (on the matrices) do not test the multiplication of matrices independently, because what you may gain in the multiplication may not compensate what you can lost by choosing the wrong order for your kernels.
If the size of your matrices varies (and usually it does) test the methods for different matrix sizes.

The above program may help you find which method is best for your problem. But I cannot tell for example that storing the matrices in a given order will always be better (it depends on the problem).

Hope this helps

BlahCuda · April 7, 2010, 4:53pm

At this point, I’m not even sure if it’s worthwhile to use the cublas library for matrix multiplications. For smaller matrix sizes, I can use magmablas to give me a better performance and if I need streams to overlap data transfers with a kernel call to matrix multiplication, then I can just use Volkov’s dgemm.

Topic		Replies	Views
Row major matrix to Row major matrix multiplication in cublas GPU-Accelerated Libraries	1	5675	September 24, 2014
cublasSgemm and tansposed results is it possible to calculate trans(trans(A) * trans(B)) CUDA Programming and Performance	6	4100	August 20, 2009
Matrix multiplication and transpose in row-major matrices GPU-Accelerated Libraries cublas	1	733	July 13, 2022
Help in using CUBLAS CUDA Programming and Performance	2	2185	January 29, 2012
Matrix-Vector Multiply with cublasDgemv CUDA Programming and Performance	4	3180	January 2, 2010
Having trouble using cublas_dgemm with row major matrix [solved] GPU-Accelerated Libraries	1	955	November 23, 2016
cublasSetMatrix() and cublasGetMatrix() for dealing with matrices stored in row major order GPU-Accelerated Libraries	4	6375	September 10, 2015
How to use CUBLAS in C ? CUDA Programming and Performance	1	1010	August 1, 2011
Questions on cublasCgemm() Maximize Performance CUDA Programming and Performance	1	4728	February 15, 2008
CUBLAS, simple product of matrix CUDA Programming and Performance	3	2274	July 17, 2008

Alternatives to cublas matrix multiplication (row major)

Related topics