Critique of small convolution kernel

We’ve written up a small image convolution as part of a larger program. Here is the kernel for the x-direction. It is called with a block for each row and a thread for each column.

__global__ void 

gradientKernel( unsigned char* g_sequence, short int* g_Gx) 

{

	short int tmp = 0;    

	int MID = (NUM_IMAGES - 1)/2;	//this will give the index to the middle image

	int I = threadIdx.x;  	//currently executing thread number

	int B = blockIdx.x;  	//currently executing block number

	short int resultX = 0;

	//read one row of the middle image into shared memory

	__shared__ short int sequence[IMAGE_WIDTH]; 

	sequence[I] = g_sequence[MID*IMAGE_SIZE + B*IMAGE_WIDTH + I];

	//threads must be in sync before calculations are made to ensure that sequence[] is properly populated

	__syncthreads();

	//Calculate gradient in the x direction.  Note: there is a two-pixel wide border around

	//the image where we can't calculate the gradient.  This is filled with zeros.

	if( !(I < 2 || I >= IMAGE_WIDTH-2) && !(B < 2 || B >= IMAGE_HEIGHT-2) ) 

	{

  tmp = sequence[I-2];

  resultX  = tmp*-1;

  tmp = sequence[I-1];

  resultX += tmp*8;

  tmp = sequence[I+1];

  resultX += tmp*-8;

  tmp = sequence[I+2];

  resultX += tmp*1;

	}

	g_Gx[B*IMAGE_WIDTH + I] = resultX;

}

Now we don’t have much experience with image processing, GPUs, or parallel programming, so we’d appreciate any comments/suggestions from you experts on how to improve the performance. Thanks!