How to implement extensive matrix - vector multiplications with small matrices but large vectors

Hello everyone,

in a project I am currently working on, I have problem on performing repeated matrix - vector multiplications.

I have a set (10-30) of vectors with a fixed size of typically 8192 entries. Furthermore, I have a 4x4 matrix. The computational task is simply perform a matrix multiplication of this matrix with 4 consecutive entries of the vectors (so matrix times entries zero to three,so matrix times entries four to seven,… ) and store the resulting vector at the original position.

I tried to do this by putting both the matrix and and parts of the vectors in shared memory but the performance is horrible since it leads to banking conflicts, so even a CPU implementation is faster. Unfortunately, this is a post processing step of a much larger computation which performs really well on a GPU. So I am forced to perform this step also on the GPU (the entire update procedure is performed several thousand times).

Has anybody an idea how this could be done efficiently?


bank-conflict can be avoided by padding.

You app is memory-bound, same as BLAS2.

What is your configuration, including (grid, blocks) and occupancy? What is Graphic card runnung on your app?
Do you reach peak bandwidth?

Hi and thank you for the prompt reply.

Yes, I thought so that the problem is memory bound. Nevertheless, the performance is really bad. Doing the same thing with one CPU core (and a tenth of the bandwidth) is faster.

The device is a GTX 470 card.

The occupancy is: Used 15 registers, 2736+16 bytes smem, 4 bytes cmem[1]

The blocksize is 128 (32 subvectors of 4 entries each). The number of blocks depends then of the total size of the vectors.

I guess I am not reaching maximal bandwidth since a CPU implementation is faster, except for the transfer from and to the GPU.


GTX470 can reach 140GB/s bandwidth. Occupancy is 1024 threads per SM, very good.

Do you compile with -arch=sm_20?

Could you post pseudocode, I wonder whether coalesced property is kept or not.

I tried both the sm_10 and sm_20 and couldn’t notice any difference in execution speed

const int i = blockDim.x * blockIdx.x + threadIdx.x;

const int tid = threadIdx.x;

const int RowIndex = i%4;

// VecSize and MatSize are known during compilation

__shared__ float vector[VecSize];

__shared__ float Mat[MatSize];

float val = 0;

// fetch data

vector[tid] =  some_ptr[i];

//fetch Mat

if(tid < 16)


    Mat[tid] =  g_Mat[tid];    



for(int It = 0;It < NumberPoints; It++)


    val +=  1.0 * Mat[4 * RowIndex + It] * e_tan[tid - RowIndex + It];


some_ptr[i] = val;

There is no need for shared memory, a 4x4 matrix fits nicely into registers. Use float4 variables to efficiently get 4 consecutive floats into each thread’s registers. Should work well on the GPU.

You can use 128-bit load, then each thread computes 4 consecutive elements.

Ban-conflict disappears because of broadcast, and you don’t need to move

vector into shared memory.

Suppose X is a matrix contains “set (10-30) of vectors with a fixed size of typically 8192 entries”.

X is column-major and dimension is 8192 x width, width is 10~30.

also ldx=8192 is leading dimension of X.

dim3 block(128,1);

dim3 grid(8192/(128*4), width);

foo( (float4 *)X, g_Mat, ldx/4 );

/* g_mat[4][4] is row-major */

__global__ void foo( float4 *some_ptr, float* g_Mat, int ldx )


    const int i = blockDim.x * blockIdx.x + threadIdx.x;

    const int tid = threadIdx.x;

    // VecSize and MatSize are known during compilation

    __shared__ float Mat[MatSize]; // row-major

    //fetch Mat

    if(tid < 16){

        Mat[tid] = g_Mat[tid]; 



float4 xreg = some_ptr[i + ldx*blockIdx.y];

    float  y[4]; 

float* As = Mat ;

    #pragma unroll

    for(int j = 0; j < 4; j++){ 

        y[j] = As[0] * xreg.x ;

        y[j] = As[1] * xreg.y + y[j];

        y[j] = As[2] * xreg.z + y[j];

        y[j] = As[3] * xreg.w + y[j];

        As += 4 ;


xreg.x = y[0];

    xreg.y = y[1];

    xreg.z = y[2];

    xreg.w = y[3];

    some_ptr[i + ldx*blockIdx.y] = xreg ; 


Thank you for the suggested code, I will try it the next day (now on a business trip) and write how much it improved the speed.