Slow with Shared Memory Slow with Shared Memory on Fermi for convolution computation

Hello fellow CUDA users,

I am doing convolution on a double array (Md) with a kernel (Nd) of size, KSZ = 23. For this I am experimenting with and without the use of shared memory.
In the without shared memory case I am doing the computation directly from the global memory (Md and Nd). With the shared memory case I am initially loading data from Md into PS (shared memory) and then doing computation using PS.

Without shared memory code snippet of CUDA kernel:

double sum1=0;
if(tx<10 && ty<10)
{
for(int i=0;i<KSZ;i++)
for(int j=0;j<KSZ;j++)
sum1 += Nd[i*KSZ+j]*Md[(Row+i)cols+(Col+j)];
R[Row
NC+Col] = sum1;
}

and with shared memory code snippet of CUDA kernel:

double sum1=0;
if(tx<10 && ty<10)
{
for(int i=0;i<KSZ;i++)
for(int j=0;j<KSZ;j++)
sum1 += Nd[i*KSZ+j]PS[tx+i][ty+j];
R[Row
NC+Col] = sum1;
}

For both the above code snippets, I am using the following assignments:

int tx = threadIdx.x;
int ty = threadIdx.y;

int bx = blockIdx.x;
int by = blockIdx.y;

int Row = 10by+ty;
int Col = 10
bx+tx;

Results:
Without Shared Memory Runtime of the snippet = 6770 u-sec
With Shared Memory Runtime snippet = 16240 u-sec.

So I am worried about the runtime being more in shared memory case. One reason I could think of is that I might be losing parallelism in terms of number of blocks being assigned to each SM, if I am using lot of shared memory per SM.
But I am not sure and there might be other reasons to it, so I need help on this.

Thank you in advance

  • Divakar

If you are curious if this is related to the number of blocks per SM, you should use the profiler (command line or Visual) to see what the occupancy is. If you have a compute capability 2.x device, then it is possible the L1 and L2 cache are sufficient to hold the convolution kernel, eliminating the need for shared memory.

Thank you Seibert for replying. I need to get a hang of using Visual Profiler. I have replaced the Kernel ND with constants like 1.0 and still the runtime is more with shared memory. Read on drdobbs site that shared memory fetching is as good as from registers, so that’s why I was suprised with the results.

  • Divakar