I’m new to CUDA but I have question regarding a kernel I’ve made for doing complex matrixvector multiplication.
The matrices and vectors are 1D arrays, stored as type float2, float2.x for the real part and float2.y for the imaginary.
The matrices are 4x4, 8x8 or 16x16 and so the vectors of the appropriate sizes to conform.
The program works as follows.

Memory is allocated via cudaMalloc in device memory for the matrice array, vectors array and the vector result array.

The matrices are loaded into device global memory along with the vectors and a vector result array via cudamemcpy

The block size is 192 (seemed to give fastest results) and each block thread multiples a single matrixvector.

For now multiples of 192 are used for the number of matrices/vectors so there is no need for checking out of bounds on the last block.
The kernel works as follows.
 Parameters: *matrice pointer, *vectors pointer, int number of matrices, *vector result pointer.

calculate global matrix offset  blockIdx.x * blockDim.x * mRow * nCol + threadIdx.x * mRow*nCol

calculate global vector offset  blockIdx.x * blockDim.x * nCol + threadIdx.x*nCol

The vector is used for each row and so is copied into shared memory.

The global matrix is multiplied by the shared vector and the result stored in a global vector. The matrix is left in global memory as each element is only used one whereas the vector is used multiple times.

At the end of each row, the result is stored in the global vector.
Currently I’m only seeing about 2x the speedup compared to a naive CPU implementation. I’m using a GeForce 9500GT and a P4 3.2GHz.
The code is posted below, any advice on any mistakes I’ve made would be greatly appreciated. Even if there are existing projects doing this, I would appreciate feedback on my kernel.
Thanks for your time.
Note: The parameters passed to the kernel are just pointers which have been cast to int. Don’t ask, those were the specs. Oh and I’ve moved things around for easier reading in the post, don’t think I’ve broken anything. Thanks again.
typedef float2 Complex;
typedef unsigned int MatrixArrayID, VectorArrayID
#define make_complex(real,imag)(make_float2(real,imag))
static __device__ __host__ inline Complex ComplexAdd(Complex a, Complex b) {
Complex c;
c.x = a.x + b.x;
c.y = a.y + b.y;
return c;
}
static __device__ __host__ inline Complex ComplexMul(Complex a, Complex b) {
Complex c;
c.x = (a.x * b.x)  (a.y * b.y);
c.y = (a.x * b.y) + (a.y * b.x);
return c;
}
extern __shared__ Complex shared_data2[];
__global__ void MVKernel2(MatrixArrayID maid, VectorArrayID vaid,
int mRow, int nCol, int numOfMatrices, int numOfVectors, VectorArrayID vaResultID) {
// Casts for matrix, vector, result vector
register Complex* matrix = (Complex*) maid;
register Complex* vector = (Complex*) vaid;
register Complex* vaResult = (Complex*) vaResultID;
Complex* shared_vector = (Complex*) shared_data2;
register Complex newVectorElement = make_complex(0, 0);
register int matrixOffset = blockIdx.x * blockDim.x * mRow * nCol + threadIdx.x * mRow*nCol;
register int vectorOffset = blockIdx.x * blockDim.x * nCol + threadIdx.x*nCol;
#ifdef __DEVICE_EMULATION__
assert(matrixOffset < mRow * nCol * numOfMatrices);
assert(vectorOffset < nCol * numOfMatrices);
#endif
// Copy vector into shared mem
#pragma unroll
for (register int i = 0; i < nCol; i++) {
shared_vector[i] = vector[vectorOffset + i];
}
// Multiply Matrix by Vector in shared mem storing the intermediate result
// in register newVectorElement and then the final result in the global
// vector array vaResult.
#pragma unroll
for (register int j = 0; j < mRow; j++) {
newVectorElement = make_complex(0, 0);
#pragma unroll
for (register int i = 0; i < nCol; i++) {
register Complex mult = ComplexMul(matrix[matrixOffset + mRow * j + i], shared_vector[i]);
newVectorElement= ComplexAdd(newVectorElement, mult);
}
//write result to global mem array
vaResult[vectorOffset + j] = newVectorElement;
}
}