Advice - Complex Matrix-Vector Multiplication

I’m new to CUDA but I have question regarding a kernel I’ve made for doing complex matrix-vector multiplication.

-The matrices and vectors are 1D arrays, stored as type float2, float2.x for the real part and float2.y for the imaginary.

-The matrices are 4x4, 8x8 or 16x16 and so the vectors of the appropriate sizes to conform.

The program works as follows.

  • Memory is allocated via cudaMalloc in device memory for the matrice array, vectors array and the vector result array.

  • The matrices are loaded into device global memory along with the vectors and a vector result array via cudamemcpy

  • The block size is 192 (seemed to give fastest results) and each block thread multiples a single matrix-vector.

  • For now multiples of 192 are used for the number of matrices/vectors so there is no need for checking out of bounds on the last block.

The kernel works as follows.

  • Parameters: *matrice pointer, *vectors pointer, int number of matrices, *vector result pointer.
  1. calculate global matrix offset - blockIdx.x * blockDim.x * mRow * nCol + threadIdx.x * mRow*nCol

  2. calculate global vector offset - blockIdx.x * blockDim.x * nCol + threadIdx.x*nCol

  3. The vector is used for each row and so is copied into shared memory.

  4. The global matrix is multiplied by the shared vector and the result stored in a global vector. The matrix is left in global memory as each element is only used one whereas the vector is used multiple times.

  5. At the end of each row, the result is stored in the global vector.

Currently I’m only seeing about 2x the speedup compared to a naive CPU implementation. I’m using a GeForce 9500GT and a P4 3.2GHz.

The code is posted below, any advice on any mistakes I’ve made would be greatly appreciated. Even if there are existing projects doing this, I would appreciate feedback on my kernel.

Thanks for your time.

Note: The parameters passed to the kernel are just pointers which have been cast to int. Don’t ask, those were the specs. Oh and I’ve moved things around for easier reading in the post, don’t think I’ve broken anything. Thanks again.

typedef float2 Complex;

typedef unsigned int MatrixArrayID, VectorArrayID

#define make_complex(real,imag)(make_float2(real,imag))

static __device__ __host__ inline Complex ComplexAdd(Complex a, Complex b) {

	Complex c;

	c.x = a.x + b.x;

	c.y = a.y + b.y;

	return c;

}

static __device__ __host__ inline Complex ComplexMul(Complex a, Complex b) {

	Complex c;

	c.x = (a.x * b.x) - (a.y * b.y);

	c.y = (a.x * b.y) + (a.y * b.x);

	return c;

}

extern __shared__ Complex shared_data2[];

__global__ void MVKernel2(MatrixArrayID maid, VectorArrayID vaid,

		int mRow, int nCol, int numOfMatrices, int numOfVectors, VectorArrayID vaResultID) {

	// Casts for matrix, vector, result vector

	register Complex* matrix = (Complex*) maid;

	register Complex* vector = (Complex*) vaid;

	register Complex* vaResult = (Complex*) vaResultID;

	Complex* shared_vector = (Complex*) shared_data2;

	register Complex newVectorElement = make_complex(0, 0);

	register int matrixOffset = blockIdx.x * blockDim.x * mRow * nCol + threadIdx.x * mRow*nCol;

	register int vectorOffset = blockIdx.x * blockDim.x * nCol + threadIdx.x*nCol;

#ifdef __DEVICE_EMULATION__

	assert(matrixOffset < mRow * nCol * numOfMatrices);

	assert(vectorOffset < nCol * numOfMatrices);

#endif

	// Copy vector into shared mem

	#pragma unroll

	for (register int i = 0; i < nCol; i++) {

		shared_vector[i] = vector[vectorOffset + i];

	}

	// Multiply Matrix by Vector in shared mem storing the intermediate result

	// in register newVectorElement and then the final result in the global

	// vector array vaResult.

	#pragma unroll

	for (register int j = 0; j < mRow; j++) {

		newVectorElement = make_complex(0, 0);

	   #pragma unroll

		for (register int i = 0; i < nCol; i++) {

			register Complex mult = ComplexMul(matrix[matrixOffset + mRow * j + i], shared_vector[i]);

			newVectorElement= ComplexAdd(newVectorElement, mult);

		}

		//write result to global mem array

		vaResult[vectorOffset + j] = newVectorElement;

	}

}

I’ve unrolled the loading of the shared vector (below) and it’s quicker, although now it’s less portable, would only work for matrices/vectors Mx4/1x4 .

There must be something else I’m doing wrong? I’ve been reading about measuring GFLOPS and if I’m correct, I’m getting ~ 1 GFLOP. Does a floating point add count as 1 operation? Does a floating multiplication count as 1?

Thanks for your advice.

shared_vector[0] = vector[vectorOffset];

		shared_vector[1] = vector[vectorOffset + 1];

		shared_vector[2] = vector[vectorOffset + 2];

		shared_vector[3] = vector[vectorOffset + 3];

Accesses to the matrix are not coalesced. As you already discovered, your program is not compute-bound, but is instead IO bound.

Rearranging the computation to use coalesced access is a headache, especially since we usually want to focus on the algorithm and don’t want to have to deal with such harware details. It’s unfortunate, but it makes a huge difference.

Thanks for the reply. Some guy over in #cuda on freenode explained this to me but I’m not entirely sure I get it, even after reading the programming guide :/

  • I’ve got two arrays of float2, where sizeof(float2) is 8 bytes.
  • From what I understand, I should access each index sequentially by sequential threads?

So, looking at the array storing the matrix,

threadIdx.x accesses matrix array index
0 => 0
1 => 1
2 => 2

15 => 15

The block size is 64 , and the array index goes from [0, 1023], so I’ve got each thread loading (to shared) array indexes in the pattern: thread(1)[0,16,32…,1023], thread(2)[1,17,33,…,1022] etc…

Is this what I should be doing? If so I’ve got it loading into shared mem then the calculation will take place in that, finally be copied back to global mem in the same pattern. My idea was that if I understand this correctly, it should be faster to,

  • copy coalesced global to shared. Write to shared. Coalesced write shared to global.
    than,
  • uncoalesced reads and writes to global.

I’ve seen way more complex things done with CUDA so I guess I’ve lost the plot and this should be easy? O_o

Thanks for your time.