when to use shared memory

vrnova · March 10, 2009, 3:39pm

hello is using shared memory always perform better? I am trying to do convolution calculation on a 3d volume in x, y, and z direction.The 3D volume is stored as a linear memory buffer with the order of row first, column second and slice third. I have below kernels to do convolution with following block/thread configuration: 2d grid (256 by 256 blocks ) and 1d block (256 threads) so each thread access one pixel in the 256256256 volume. To speed up convolution calculation I load current row or column into shared memory ( I am doing separable Gaussian filter here). When using shared memory, the first kernel gaussianFilterX get speed up but the second kernel gaussianFilterY actually slows down much. These two kernels are the same except the mapping of blockIdx and threadIdx to x, y, z coords in the volume.

Any helps? I am not familiar with CUDA yet.

Thanks ahead!

global void gaussianFilterX( float* d_b, float* d_a, float* kernel, int halfKernelWidth)
//this one can be speed up by shared memory
{
int x, y, z;

z = blockIdx.y;
y = blockIdx.x;
x = threadIdx.x;

int imageSize = 256*256;
int lineSize = 256;
       
int xx;
float sum = 0.0f;

extern __shared__ int s_data[];
s_data[threadIdx.x] = d_a[z*imageSize + y*lineSize + x];
__syncthreads();

for (int i = -halfKernelWidth; i <= halfKernelWidth; i++)
{
    xx = (threadIdx.x + i + 256) & 255;

    sum += s_data[xx] * kernel[i+halfKernelWidth];
}

d_b[z*imageSize + y*lineSize + x] = sum;

}

global void gaussianFilterY( float* d_b, float* d_a, float* kernel, int halfKernelWidth)
//this one will be slow down by shared memory
{
int x, y, z;

z = blockIdx.y;
x = blockIdx.x;
y = threadIdx.x;

int imageSize = 256*256;
int lineSize = 256;
       
int xx;
float sum = 0.0f;

extern __shared__ int s_data[];
s_data[threadIdx.x] = d_a[z*imageSize + y*lineSize + x];
__syncthreads();

for (int i = -halfKernelWidth; i <= halfKernelWidth; i++)
{
    xx = (threadIdx.x + i + 256) & 255;

    sum += s_data[xx] * kernel[i+halfKernelWidth];
}

d_b[z*imageSize + y*lineSize + x] = sum;

}

Topic		Replies	Views
Memory optimal approach to Z dimension in separable 3D convolution. CUDA Programming and Performance	4	981	July 12, 2015
convolution using shared memory slowdown instead of speedup... CUDA Programming and Performance	1	5578	March 11, 2010
Help: Shared memory vs. Caching in ConvolutionSeparable Example CUDA Programming and Performance	1	4476	December 7, 2008
Why shared memory is slower than global memory with gradient computation? CUDA Programming and Performance	6	5134	November 9, 2009
Correct Use of Shared Memory? CUDA Programming and Performance	1	712	January 6, 2010
Shared memory problem CUDA Programming and Performance	3	2259	February 8, 2008
Question regarding warp efficiency... CUDA Programming and Performance	9	15109	March 13, 2007
Simple 2d Convolution Low Pass filter like blur filter CUDA Programming and Performance	3	2819	April 15, 2014
CUDA OpenGL post-processing example CUDA Programming and Performance	9	13245	May 27, 2007
Shared memory out of bounds (simple convolution) CUDA Programming and Performance	6	690	June 21, 2017

when to use shared memory

Related topics