Hello fellow CUDA users,
I am doing convolution on a double array (Md) with a kernel (Nd) of size, KSZ = 23. For this I am experimenting with and without the use of shared memory.
In the without shared memory case I am doing the computation directly from the global memory (Md and Nd). With the shared memory case I am initially loading data from Md into PS (shared memory) and then doing computation using PS.
Without shared memory code snippet of CUDA kernel:
double sum1=0;
if(tx<10 && ty<10)
{
for(int i=0;i<KSZ;i++)
for(int j=0;j<KSZ;j++)
sum1 += Nd[i*KSZ+j]*Md[(Row+i)cols+(Col+j)];
R[RowNC+Col] = sum1;
}
and with shared memory code snippet of CUDA kernel:
double sum1=0;
if(tx<10 && ty<10)
{
for(int i=0;i<KSZ;i++)
for(int j=0;j<KSZ;j++)
sum1 += Nd[i*KSZ+j]PS[tx+i][ty+j];
R[RowNC+Col] = sum1;
}
For both the above code snippets, I am using the following assignments:
int tx = threadIdx.x;
int ty = threadIdx.y;
int bx = blockIdx.x;
int by = blockIdx.y;
int Row = 10by+ty;
int Col = 10bx+tx;
Results:
Without Shared Memory Runtime of the snippet = 6770 u-sec
With Shared Memory Runtime snippet = 16240 u-sec.
So I am worried about the runtime being more in shared memory case. One reason I could think of is that I might be losing parallelism in terms of number of blocks being assigned to each SM, if I am using lot of shared memory per SM.
But I am not sure and there might be other reasons to it, so I need help on this.
Thank you in advance
- Divakar