I have been implementing a vector matrix c = b*A multiplication using cuda. Well I am simply trying to reconfigure the simple sgemm code from the example. In the first instance I have only worked with global memory. The code is very simple:

```
int bBegin = __mul24(__mul24(blx,block_dim_x),HEIGHT_A);
float Csub = 0;
for (int a = 0, b = bBegin; a < HEIGHT_A; a += BLOCK_DIM_Y, b += BLOCK_DIM_Y)
{
for (int i = 0; i < BLOCK_DIM_Y; i++)
Csub += global_b[a + i] * global_A[b + i + _mul24(thx,HEIGHT_A)];
__syncthreads();
}
int c = __mul24(blx,block_dim_x);
global_c[c + thx] = Csub;
```

The program works fine if the matrix sizes are a multiple of the block dimensions. I get around 420 ms. Well I thought that extending my code by using shared memory would significantly increase the performance which did not happen! Just the contrary I got 550 ms.

```
int bBegin = __mul24(__mul24(blx,block_dim_x),HEIGHT_A);
float Csub = 0;
for (int a = 0, b = bBegin; a < HEIGHT_A; a += BLOCK_DIM_Y, b += BLOCK_DIM_Y)
{
__shared__ float shared_b[BLOCK_DIM_Y];
if(thx < BLOCK_DIM_Y)
shared_b[thx] = global_b[a + thx];
__syncthreads();
for (int i = 0; i < BLOCK_DIM_Y; i++){
Csub += shared_b[i] * global_A[b + i + __mul24(thx,HEIGHT_A)];
}
__syncthreads();
}
int c = __mul24(blx,block_dim_x);
global_c[c + thx] = Csub;
```

I tried to increase the block dimension Y, but nothing really works. What is wrong with my code ? Are there any bank conflicts ? Wrong implementation of vector matrix multiplication ?

thanks in advance