shared memory problem

I am now solving a matrix- vector multiplication problem and try to compare shared memory with globe memory. The problem is described like this: y=Ax, A is a mn matrix, and x is a n*1 vector.
For the kernel program implemented by globe memory, I have :

global void mv_kernel(float *Yd, float Ad, float xd, int m, int n)
{
float sum = 0;
int idx=threadIdx.x+blockIdx.x
blockDim.x;
if (idx>=m) return;
for (int j=0;j<n;j++)
sum += Ad[idx
n+j]*xd[j];
Yd[idx]=sum;
}

And for the kernel program implemented by shared memory, I have:

global void mv_kernel(float Yd, float Ad, float xd, int m,
int n)
{
float sum = 0;
shared float xds[512];
int idx=threadIdx.x+blockIdx.x
blockDim.x;
if (idx>=m) return;
for (int i=0;i<(N‐1)/512+1;i++) {
if ((threadIdx.x+i
512)<N)
xds[threadIdx] = xd[threadIdx.x+i
512];
__syncthreads();
for (int j=0;j<512;j++)
if ((i512+j<n))
sum += Ad[idx
n+j+i*512]*xds[j];
__syncthreads();
}
Yd[idx]=sum;
}
In the shared memory example, the shared memory for Xd is cut into chunks in order to avoid surpassing the memory limit.

It is expected that shared memory example should be faster than globe memory example. However, when I implement both of these programs, I find the globe memory example is always a little bit faster than shared memory example.
Do I do something wrong?

Coalescing accesses to Ad would be more important than coalescing reads of xd.

Kindly stop creating duplicate threads, particularly when your concern has already been addressed by another member. Please continue to use this thread for your issue.