I am now solving a matrix- vector multiplication problem and try to compare shared memory with globe memory. The problem is described like this: y=A*x, A is a m*n matrix, and x is a n*1 vector.

For the kernel program implemented by globe memory, I have :

**global** void mv_kernel(float *Yd, float *Ad, float xd, int m, int n)
{
float sum = 0;
int idx=threadIdx.x+blockIdx.xblockDim.x;
if (idx>=m) return;
for (int j=0;j<n;j++)
sum += Ad[idx*n+j]*xd[j];

Yd[idx]=sum;

}

And for the kernel program implemented by shared memory, I have:

**global** void mv_kernel(float *Yd, float Ad, float xd, int m,
int n)
{
float sum = 0;
shared float xds[512];
int idx=threadIdx.x+blockIdx.xblockDim.x;
if (idx>=m) return;
for (int i=0;i<(Nâ€1)/512+1;i++) {
if ((threadIdx.x+i512)<N)
xds[threadIdx] = xd[threadIdx.x+i*512];

__syncthreads();

for (int j=0;j<512;j++)

if ((i

*512+j<n))*

sum += Ad[idxn+j+i*512]*xds[j];

sum += Ad[idx

__syncthreads();

}

Yd[idx]=sum;

}

In the shared memory example, the shared memory for Xd is cut into chunks in order to avoid surpassing the memory limit.

It is expected that shared memory example should be faster than globe memory example. However, when I implement both of these programs, I find the globe memory example is always a little bit faster than shared memory example.

Do I do something wrong?