shared memory problem

ultralight · April 21, 2010, 1:34am

I am now solving a matrix- vector multiplication problem and try to compare shared memory with globe memory. The problem is described like this: y=Ax, A is a mn matrix, and x is a n*1 vector.
For the kernel program implemented by globe memory, I have :

global void mv_kernel(float *Yd, float Ad, float xd, int m, int n)
{
float sum = 0;
int idx=threadIdx.x+blockIdx.xblockDim.x;
if (idx>=m) return;
for (int j=0;j<n;j++)
sum += Ad[idxn+j]*xd[j];
Yd[idx]=sum;
}

And for the kernel program implemented by shared memory, I have:

global void mv_kernel(float Yd, float Ad, float xd, int m,
int n)
{
float sum = 0;
shared float xds[512];
int idx=threadIdx.x+blockIdx.xblockDim.x;
if (idx>=m) return;
for (int i=0;i<(Nâ€1)/512+1;i++) {
if ((threadIdx.x+i512)<N)
xds[threadIdx] = xd[threadIdx.x+i512];
__syncthreads();
for (int j=0;j<512;j++)
if ((i512+j<n))
sum += Ad[idxn+j+i*512]*xds[j];
__syncthreads();
}
Yd[idx]=sum;
}
In the shared memory example, the shared memory for Xd is cut into chunks in order to avoid surpassing the memory limit.

It is expected that shared memory example should be faster than globe memory example. However, when I implement both of these programs, I find the globe memory example is always a little bit faster than shared memory example.
Do I do something wrong?

tera · April 21, 2010, 1:47am

Coalescing accesses to Ad would be more important than coalescing reads of xd.

Exitios · April 21, 2010, 3:01am

Kindly stop creating duplicate threads, particularly when your concern has already been addressed by another member. Please continue to use this thread for your issue.

Topic		Replies	Views
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10717	February 8, 2008
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3245	October 14, 2009
Matrix Multiplication: Shared vs Global Memory CUDA Programming and Performance	1	3716	June 27, 2011
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	0	615	July 12, 2011
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	3	3555	July 12, 2011
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1225	April 26, 2013
access speed of shared memory and global memory CUDA Programming and Performance	1	1100	August 6, 2009
using shared memory CUDA Programming and Performance	6	2988	September 17, 2009
Shared memory doubt CUDA Programming and Performance	5	4650	June 11, 2008
Shared memory question CUDA Programming and Performance	27	7529	June 23, 2008

shared memory problem

Related topics