in a project I am currently working on, I have problem on performing repeated matrix - vector multiplications.
I have a set (10-30) of vectors with a fixed size of typically 8192 entries. Furthermore, I have a 4x4 matrix. The computational task is simply perform a matrix multiplication of this matrix with 4 consecutive entries of the vectors (so matrix times entries zero to three,so matrix times entries four to seven,… ) and store the resulting vector at the original position.
I tried to do this by putting both the matrix and and parts of the vectors in shared memory but the performance is horrible since it leads to banking conflicts, so even a CPU implementation is faster. Unfortunately, this is a post processing step of a much larger computation which performs really well on a GPU. So I am forced to perform this step also on the GPU (the entire update procedure is performed several thousand times).
Has anybody an idea how this could be done efficiently?
Yes, I thought so that the problem is memory bound. Nevertheless, the performance is really bad. Doing the same thing with one CPU core (and a tenth of the bandwidth) is faster.
The device is a GTX 470 card.
The occupancy is: Used 15 registers, 2736+16 bytes smem, 4 bytes cmem[1]
The blocksize is 128 (32 subvectors of 4 entries each). The number of blocks depends then of the total size of the vectors.
I guess I am not reaching maximal bandwidth since a CPU implementation is faster, except for the transfer from and to the GPU.
There is no need for shared memory, a 4x4 matrix fits nicely into registers. Use float4 variables to efficiently get 4 consecutive floats into each thread’s registers. Should work well on the GPU.