Alternatives to cublas matrix multiplication (row major)

Are there alternatives to cublas using row-major?

Using column major is not always an option as I might have kernels transforming data (that might be faster when using row major).

If you are really certain about the structure of your matrices, you can use the transpose versions of gemv() and gemm() with row major ordered data.

Thank you. Are threre any performance differences between using Normal and Transposed matrix versions?

Why don’t you try it, and then report back with the results?

:)

Best performance is achieved when both matrices are in column-major mode.

Can you give plots? What’s the penalty for doing the transpose as a function of matrix size?

I did not make a plot but I did create a small program that multiplies 2 matrices in all possible modes (requires GPUMLib available at http://gpumlib.sourceforge.net/)

here is the source code (it turns out that I was wrong and the best order for the matrices is not always column-major, but depends on the their size and maybe on the device):

#include "../CudaInit.h"

#include "../../code/memory/DeviceMatrix.h"

#include <cublas.h>

#define ROWS_A (2000)

#define COLS_A (3000)

#define ROWS_B COLS_A

#define COLS_B (4000)

// Xc and Xr represent the same matrix in column-major order (Xc) and row-major order (Xr)

void RandomizeMatrix(HostMatrix<cudafloat> & Xc, HostMatrix<cudafloat> & Xr) {

	assert(Xc.Rows() == Xr.Rows() && Xc.Columns() == Xr.Columns());

	int rows = Xc.Rows();

	int columns = Xc.Columns();

	for(int r = 0; r < rows; r++) {

		for(int c = 0; c < columns; c++) {

			Xc(r, c) = Xr(r, c) = rand() / RAND_MAX;

		}

	}	

}

int main(int argc, char* argv[]) {

	CudaDevice device;

	if(!device.SupportsCuda()) {

		cout << "Device does not support cuda" << endl;

		return 0;

	}

	device.ShowInfo();

	cublasInit();

	HostMatrix<cudafloat> Ac(ROWS_A, COLS_A, ColumnMajor); // A column major

	HostMatrix<cudafloat> Ar(ROWS_A, COLS_A, RowMajor); // A row major

	RandomizeMatrix(Ac, Ar);

	HostMatrix<cudafloat> Bc(ROWS_B, COLS_B, ColumnMajor); // B column major

	HostMatrix<cudafloat> Br(ROWS_B, COLS_B, RowMajor); // B row major

	RandomizeMatrix(Bc, Br);

	DeviceMatrix<cudafloat> d_Ac(Ac);

	DeviceMatrix<cudafloat> d_Ar(Ar);

	DeviceMatrix<cudafloat> d_Bc(Bc);

	DeviceMatrix<cudafloat> d_Br(Br);

	

	DeviceMatrix<cudafloat> d_Result(ROWS_A, COLS_B, ColumnMajor);

	cout << endl << "Ac x Bc" << endl;

	size_t initialTime = clock();

	DeviceMatrix<cudafloat>::Multiply(d_Ac, d_Bc, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	int time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cout << endl << "Ar x Bc" << endl;

	initialTime = clock();

	DeviceMatrix<cudafloat>::Multiply(d_Ar, d_Bc, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cout << endl << "Ac x Br" << endl;

	initialTime = clock();

	DeviceMatrix<cudafloat>::Multiply(d_Ac, d_Br, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cout << endl << "Ar x Br" << endl;

	initialTime = clock();

	DeviceMatrix<cudafloat>::Multiply(d_Ar, d_Br, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cublasShutdown();

	return 0;

}

And here are the results on a 8600 GT (Ac - A in column-major | Ar - A in row-major | Bc - B in column major | Br - B in row-major):

Ac x Bc

Time: 2203 ms

Ar x Bc

Time: 2268 ms

Ac x Br

Time: 2155 ms

Ar x Br

Time: 2257 ms

On the basis of those measurements, do you believe there is actually any difference between the four cases?

I changed the program to include several iterations:

#include "../CudaInit.h"

#include "../../code/memory/DeviceMatrix.h"

#include <cublas.h>

#define ROWS_A (2000)

#define COLS_A (3000)

#define ROWS_B COLS_A

#define COLS_B (4000)

#define ITERATIONS (100)

// Xc and Xr represent the same matrix in column-major order (Xc) and row-major order (Xr)

void RandomizeMatrix(HostMatrix<cudafloat> & Xc, HostMatrix<cudafloat> & Xr) {

	assert(Xc.Rows() == Xr.Rows() && Xc.Columns() == Xr.Columns());

	int rows = Xc.Rows();

	int columns = Xc.Columns();

	for(int r = 0; r < rows; r++) {

		for(int c = 0; c < columns; c++) {

			Xc(r, c) = Xr(r, c) = rand() / RAND_MAX;

		}

	}	

}

int main(int argc, char* argv[]) {

	CudaDevice device;

	if(!device.SupportsCuda()) {

		cout << "Device does not support cuda" << endl;

		return 0;

	}

	device.ShowInfo();

	cublasInit();

	HostMatrix<cudafloat> Ac(ROWS_A, COLS_A, ColumnMajor); // A column major

	HostMatrix<cudafloat> Ar(ROWS_A, COLS_A, RowMajor); // A row major

	RandomizeMatrix(Ac, Ar);

	HostMatrix<cudafloat> Bc(ROWS_B, COLS_B, ColumnMajor); // B column major

	HostMatrix<cudafloat> Br(ROWS_B, COLS_B, RowMajor); // B row major

	RandomizeMatrix(Bc, Br);

	DeviceMatrix<cudafloat> d_Ac(Ac);

	DeviceMatrix<cudafloat> d_Ar(Ar);

	DeviceMatrix<cudafloat> d_Bc(Bc);

	DeviceMatrix<cudafloat> d_Br(Br);

	

	DeviceMatrix<cudafloat> d_Result(ROWS_A, COLS_B, ColumnMajor);

	cout << endl << "Ac x Bc" << endl;

	size_t initialTime = clock();

	for(int i = 0; i < ITERATIONS; i++) 

		DeviceMatrix<cudafloat>::Multiply(d_Ac, d_Bc, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	int time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cout << endl << "Ar x Bc" << endl;

	initialTime = clock();

	for(int i = 0; i < ITERATIONS; i++) 

		DeviceMatrix<cudafloat>::Multiply(d_Ar, d_Bc, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cout << endl << "Ac x Br" << endl;

	initialTime = clock();

	for(int i = 0; i < ITERATIONS; i++) 

		DeviceMatrix<cudafloat>::Multiply(d_Ac, d_Br, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cout << endl << "Ar x Br" << endl;

	initialTime = clock();

	for(int i = 0; i < ITERATIONS; i++) 

		DeviceMatrix<cudafloat>::Multiply(d_Ar, d_Br, d_Result); // uses CUBLAS

	cudaThreadSynchronize();

	time = (clock() - initialTime);

	cout << "Time: " << time << " ms" << endl;

	cublasShutdown();

	return 0;

}

Here are the results for 100 iterations (8600 GT):

Ac x Bc

Time: 220650 ms

Ar x Bc

Time: 227870 ms

Ac x Br

Time: 215042 ms

Ar x Br

Time: 224910 ms

This does not allow me to do any conclusions. But I would give some advices:

  1. If you are just multiplying a few matrices, probably it does not matter if you store your matrices in column-major or row-major (at least the effort of testing which method is better may not compensate).

  2. If you need to multiply thousands or millions of matrices (for instance in an interactive method) it is worth to spend some time checking what is the best method for storing information in each matrix.

  3. If you use additional kernels (on the matrices) do not test the multiplication of matrices independently, because what you may gain in the multiplication may not compensate what you can lost by choosing the wrong order for your kernels.

  4. If the size of your matrices varies (and usually it does) test the methods for different matrix sizes.

The above program may help you find which method is best for your problem. But I cannot tell for example that storing the matrices in a given order will always be better (it depends on the problem).

Hope this helps

At this point, I’m not even sure if it’s worthwhile to use the cublas library for matrix multiplications. For smaller matrix sizes, I can use magmablas to give me a better performance and if I need streams to overlap data transfers with a kernel call to matrix multiplication, then I can just use Volkov’s dgemm.