Shared memory problem

RyanV1 · January 29, 2008, 4:48pm

Can anyone see any obvious memory coalesence and or bank conflict problems with this kernel? (block size is 16x16)

global void conv_mean_tex_sharedmem_f(float* d_data, int d_pitch, int width, int height)
{
//x y adress within the image corresponding to the thread
const int ix = IMUL(blockDim.x-2, blockIdx.x) + threadIdx.x - 1;
const int iy = IMUL(blockDim.y-2, blockIdx.y) + threadIdx.y - 1;
//texture coord into the image
const float itx = (float)ix + 0.5f;
const float ity = (float)iy + 0.5f;

//shared memory for the entire image bloack and the apron
__shared__ float shared_data[BLOCK_SIZE * BLOCK_SIZE];

//just for clarity
const int spitch = blockDim.x;
const int sx = threadIdx.x;
const int sy = threadIdx.y;

//load into shared memory
shared_data[IDX(spitch,sx,sy)] = tex2D(texData,itx,ity);

__syncthreads();

if((threadIdx.x == 0) || (threadIdx.x == blockDim.x-1) || 
   (threadIdx.y == 0) || (threadIdx.y == blockDim.y-1) ||
   (ix >= width) || (iy >= height))
   return;

d_data[d_pitch*iy+ix] = (shared_data[IDX(spitch,sx-1,sy-1)] + 
	                     shared_data[IDX(spitch,sx  ,sy-1)] + 
					     shared_data[IDX(spitch,sx+1,sy-1)] +
						 shared_data[IDX(spitch,sx-1,sy  )] + 
	                     shared_data[IDX(spitch,sx  ,sy  )] + 
					     shared_data[IDX(spitch,sx+1,sy  )] +
					     shared_data[IDX(spitch,sx-1,sy+1)] +
	                     shared_data[IDX(spitch,sx  ,sy+1)] +
					     shared_data[IDX(spitch,sx+1,sy+1)])/9.0f;

}

RyanV1 · January 29, 2008, 6:16pm

The reason i ask is because i get faster results without shared memory, just relying on the texture cache:

global void conv_mean_tex_f(float* d_data, int d_pitch, int width, int height)
{
const int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;
const int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;
const float x = (float)ix + 0.5f;
const float y = (float)iy + 0.5f;

if(ix >= width || iy >= height)
	return;

d_data[d_pitch*iy+ix] = (tex2D(texData,x-1,y-1) +
						 tex2D(texData,x  ,y-1) +
						 tex2D(texData,x+1,y-1) +
						 tex2D(texData,x-1,y  ) +
						 tex2D(texData,x  ,y  ) +
						 tex2D(texData,x+1,y+1) +
						 tex2D(texData,x  ,y+1) + 
						 tex2D(texData,x-1,y+1))/9.0f;

}

sicb0161 · February 8, 2008, 4:34pm

The reason i ask is because i get faster results without shared memory, just relying on the texture cache:

global void conv_mean_tex_f(float* d_data, int d_pitch, int width, int height)

{

const int ix = IMUL(blockDim.x, blockIdx.x) + threadIdx.x;

const int iy = IMUL(blockDim.y, blockIdx.y) + threadIdx.y;

const float x = (float)ix + 0.5f;

const float y = (float)iy + 0.5f;
if(ix >= width || iy >= height)
return;
d_data[d_pitch*iy+ix] = (tex2D(texData,x-1,y-1) +
tex2D(texData,x ,y-1) +

tex2D(texData,x+1,y-1) +

tex2D(texData,x-1,y ) +

tex2D(texData,x ,y ) +

tex2D(texData,x+1,y+1) +

tex2D(texData,x ,y+1) +

tex2D(texData,x-1,y+1))/9.0f;

}

[snapback]316885[/snapback]

Maybe it is because device memory reads through texture fetching are cached as described in section 5.4. You might not need the shared memory for caching your data ?

paulius · February 8, 2008, 7:48pm

What are the performance numbers?

Try using a 2D smem array. Code will be easier to read, and compiler does a good job with indexing.

Also, not that your boundary condition is more complicated in the non-texture code. If the time difference is small, see if this is the additional cost.

Paulius

Topic		Replies	Views
Is it already the mostly optimized version? CUDA Programming and Performance	2	1513	January 22, 2009
Is it correct for mono image process CUDA Programming and Performance	2	2940	January 22, 2009
Shared Memory usage slows kernel with texture fetch CUDA Programming and Performance	8	4180	June 20, 2011
Problems with coalescing memory accesses CUDA Programming and Performance	4	3784	August 26, 2009
Shared memory bank conflicts? CUDA Programming and Performance	0	841	June 4, 2009
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5508	March 6, 2007
Shared Memory Access CUDA Programming and Performance	5	4599	May 24, 2007
Shared memory question CUDA Programming and Performance	27	7418	June 23, 2008
convolution using shared memory slowdown instead of speedup... CUDA Programming and Performance	1	5588	March 11, 2010
Reading same data CUDA Programming and Performance	2	8457	July 13, 2011

Shared memory problem

Related topics