Global vs Shared Memory Grayscale performance same for both the codes.

Hi gys,

I measured the performance of my grayscale filter that convert input rgb image to grayscale using both shared and global memory. And surprisingly the shared memory code took more time than global memory. Below are the pseudo code’s for the filters. Can anyone point out where i am going wrong? or is it just that using shared memory for the grayscale is a bad decision.

The pseudo code with shared memory:
shared unsigned char sh_Tile_in[16164];
int tx = threadIdx.x + (blockIdx.x * blockDim.x);
int ty = threadIdx.y + (blockIdx.y * blockDim.y);
int offset = tx + ty * blockDim.x*gridDim.x;
int sh_offset = threadIdx.x + threadIdx.y * 16;
/some copy stuff/
if(offset < width * height)
{sh_Tile_in[sh_offset] = 0.3 * (sh_Tile_in[sh_offset * 4 + 0]) + 0.6 * (sh_Tile_in[sh_offset * 4 + 1]) + 0.1 * (sh_Tile_in[sh_offset * 4 + 2]);}
__syncthreads();
gpu_in[offset] = sh_Tile_in[sh_offset];

The code with global memory:
if(offset < width * height)
{
color = 0.3 * gpu_in[offset * 4 + 0] + 0.6 * gpu_in[offset * 4 + 1] + 0.1 * gpu_in[offset * 4 + 0];
gpu_in_4[offset * 4 + 0] = color;
gpu_in_4[offset * 4 + 1] = color;
gpu_in_4[offset * 4 + 2] = color;
gpu_in_4[offset * 4 + 3] = 0;
}

Thanks in advance…

Why do you expect the shared memory version to be faster? It does not reuse any data from shared memory, so the number of global memory reads is the same as in the other version, just with more overhead