Advice - Complex Matrix-Vector Multiplication

jonface · May 11, 2009, 3:10pm

I’m new to CUDA but I have question regarding a kernel I’ve made for doing complex matrix-vector multiplication.

-The matrices and vectors are 1D arrays, stored as type float2, float2.x for the real part and float2.y for the imaginary.

-The matrices are 4x4, 8x8 or 16x16 and so the vectors of the appropriate sizes to conform.

The program works as follows.

Memory is allocated via cudaMalloc in device memory for the matrice array, vectors array and the vector result array.
The matrices are loaded into device global memory along with the vectors and a vector result array via cudamemcpy
The block size is 192 (seemed to give fastest results) and each block thread multiples a single matrix-vector.
For now multiples of 192 are used for the number of matrices/vectors so there is no need for checking out of bounds on the last block.

The kernel works as follows.

Parameters: *matrice pointer, *vectors pointer, int number of matrices, *vector result pointer.

calculate global matrix offset - blockIdx.x * blockDim.x * mRow * nCol + threadIdx.x * mRow*nCol
calculate global vector offset - blockIdx.x * blockDim.x * nCol + threadIdx.x*nCol
The vector is used for each row and so is copied into shared memory.
The global matrix is multiplied by the shared vector and the result stored in a global vector. The matrix is left in global memory as each element is only used one whereas the vector is used multiple times.
At the end of each row, the result is stored in the global vector.

Currently I’m only seeing about 2x the speedup compared to a naive CPU implementation. I’m using a GeForce 9500GT and a P4 3.2GHz.

The code is posted below, any advice on any mistakes I’ve made would be greatly appreciated. Even if there are existing projects doing this, I would appreciate feedback on my kernel.

Thanks for your time.

Note: The parameters passed to the kernel are just pointers which have been cast to int. Don’t ask, those were the specs. Oh and I’ve moved things around for easier reading in the post, don’t think I’ve broken anything. Thanks again.

typedef float2 Complex;

typedef unsigned int MatrixArrayID, VectorArrayID

#define make_complex(real,imag)(make_float2(real,imag))

static __device__ __host__ inline Complex ComplexAdd(Complex a, Complex b) {

	Complex c;

	c.x = a.x + b.x;

	c.y = a.y + b.y;

	return c;

}

static __device__ __host__ inline Complex ComplexMul(Complex a, Complex b) {

	Complex c;

	c.x = (a.x * b.x) - (a.y * b.y);

	c.y = (a.x * b.y) + (a.y * b.x);

	return c;

}

extern __shared__ Complex shared_data2[];

__global__ void MVKernel2(MatrixArrayID maid, VectorArrayID vaid,

		int mRow, int nCol, int numOfMatrices, int numOfVectors, VectorArrayID vaResultID) {

	// Casts for matrix, vector, result vector

	register Complex* matrix = (Complex*) maid;

	register Complex* vector = (Complex*) vaid;

	register Complex* vaResult = (Complex*) vaResultID;

	Complex* shared_vector = (Complex*) shared_data2;

	register Complex newVectorElement = make_complex(0, 0);

	register int matrixOffset = blockIdx.x * blockDim.x * mRow * nCol + threadIdx.x * mRow*nCol;

	register int vectorOffset = blockIdx.x * blockDim.x * nCol + threadIdx.x*nCol;

#ifdef __DEVICE_EMULATION__

	assert(matrixOffset < mRow * nCol * numOfMatrices);

	assert(vectorOffset < nCol * numOfMatrices);

#endif

	// Copy vector into shared mem

	#pragma unroll

	for (register int i = 0; i < nCol; i++) {

		shared_vector[i] = vector[vectorOffset + i];

	}

	// Multiply Matrix by Vector in shared mem storing the intermediate result

	// in register newVectorElement and then the final result in the global

	// vector array vaResult.

	#pragma unroll

	for (register int j = 0; j < mRow; j++) {

		newVectorElement = make_complex(0, 0);

	   #pragma unroll

		for (register int i = 0; i < nCol; i++) {

			register Complex mult = ComplexMul(matrix[matrixOffset + mRow * j + i], shared_vector[i]);

			newVectorElement= ComplexAdd(newVectorElement, mult);

		}

		//write result to global mem array

		vaResult[vectorOffset + j] = newVectorElement;

	}

}

jonface · May 12, 2009, 10:01am

I’ve unrolled the loading of the shared vector (below) and it’s quicker, although now it’s less portable, would only work for matrices/vectors Mx4/1x4 .

There must be something else I’m doing wrong? I’ve been reading about measuring GFLOPS and if I’m correct, I’m getting ~ 1 GFLOP. Does a floating point add count as 1 operation? Does a floating multiplication count as 1?

Thanks for your advice.

shared_vector[0] = vector[vectorOffset];

		shared_vector[1] = vector[vectorOffset + 1];

		shared_vector[2] = vector[vectorOffset + 2];

		shared_vector[3] = vector[vectorOffset + 3];

Jamie_K · May 12, 2009, 6:56pm

Accesses to the matrix are not coalesced. As you already discovered, your program is not compute-bound, but is instead IO bound.

Rearranging the computation to use coalesced access is a headache, especially since we usually want to focus on the algorithm and don’t want to have to deal with such harware details. It’s unfortunate, but it makes a huge difference.

jonface · May 12, 2009, 11:52pm

Thanks for the reply. Some guy over in #cuda on freenode explained this to me but I’m not entirely sure I get it, even after reading the programming guide External Image

I’ve got two arrays of float2, where sizeof(float2) is 8 bytes.
From what I understand, I should access each index sequentially by sequential threads?

So, looking at the array storing the matrix,

threadIdx.x accesses matrix array index
0 => 0
1 => 1
2 => 2
…
15 => 15

The block size is 64 , and the array index goes from [0, 1023], so I’ve got each thread loading (to shared) array indexes in the pattern: thread(1)[0,16,32…,1023], thread(2)[1,17,33,…,1022] etc…

Is this what I should be doing? If so I’ve got it loading into shared mem then the calculation will take place in that, finally be copied back to global mem in the same pattern. My idea was that if I understand this correctly, it should be faster to,

copy coalesced global to shared. Write to shared. Coalesced write shared to global.
than,
uncoalesced reads and writes to global.

I’ve seen way more complex things done with CUDA so I guess I’ve lost the plot and this should be easy? O_o

Thanks for your time.

Topic		Replies	Views
Vector matrix multiplication CUDA Programming and Performance	5	6216	November 30, 2011
complex multiplication & coalesced global memory access CUDA Programming and Performance	7	3251	January 2, 2012
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10761	February 8, 2008
Vector-Matrix Multiplication Is this a fast kernel? CUDA Programming and Performance	5	6780	April 19, 2010
Matrix - Vector Multiplication Can't get any faster with shared memory CUDA Programming and Performance	4	7257	September 6, 2011
Need Help Optimizing Kernel CUDA Programming and Performance	6	1174	January 7, 2014
Matrix multiplication CUDA CUDA Programming and Performance	7	3073	November 12, 2012
Uncoalesced on matrix by vector multiplication CUDA Programming and Performance	3	8071	June 24, 2009
Matrix by Vector multiplication CUDA Programming and Performance	8	1820	September 30, 2010
Checking Performance learning how to optimize CUDA codes CUDA Programming and Performance	4	2188	October 7, 2008

Advice - Complex Matrix-Vector Multiplication

Related topics