In my code, I need to perform thousands/millions of symmetric matrix-vector multiplications. My matrices are quite small, usually not bigger than 30x30.
Both matrices (upper triangle) and vectors are already stored in the memory device.
I would like to perform these operations in parallel (that is, as many as possible simultaneously) and as fast as possible. The efficiency of my code depends on it.
Any suggestions?
Can I use cublas Sgemv routine? How? I never used cublas before.
You shouldn’t use CUBLAS for this since it’s built for parallellizing over one large matrixi ( last time i checked ).
I would recommend performing one M*a = b operation per block as a good start. To avoid the complications of a warp reduction you might want to try and let each thread compute one element of the output vector ‘b’. Depending on if you have row-major or column-major storage format also consider placing the matrix ‘M’ in shared memory for improved coalescedness.