in a project I am currently working on, I have problem on performing repeated matrix - vector multiplications.
I have a set (10-30) of vectors with a fixed size of typically 8192 entries. Furthermore, I have a 4x4 matrix. The computational task is simply perform a matrix multiplication of this matrix with 4 consecutive entries of the vectors (so matrix times entries zero to three,so matrix times entries four to seven,… ) and store the resulting vector at the original position.
I tried to do this by putting both the matrix and and parts of the vectors in shared memory but the performance is horrible since it leads to banking conflicts, so even a CPU implementation is faster. Unfortunately, this is a post processing step of a much larger computation which performs really well on a GPU. So I am forced to perform this step also on the GPU (the entire update procedure is performed several thousand times).
Has anybody an idea how this could be done efficiently?