Why is this kernel slow?

I am doing a simple separable convolution with a filter of length 5 on a large 1K by 1K pixel image. I am using a GeForce 9500GT.

I am not doing it the same way as the separable convolution example that comes with CUDA SDK, and because of the way that this kernel is used, and the varying of image size, it would be difficult to change the code to match the CUDA SDK implementation.

I tested removing the if statements to see if it was the reason the kernel is slow, but they only seemed to slow the kernel down slightly. I included them to do reflection around image edges.

I also do pretty much the exact same thing for column convolution. I’m wondering if the column accesses are slow because they go by column instead of row. Should I use a texture for the column kernels in order to speed up the vertical accesses?

I am calling these functions with the number of blocks equal to the number of columns/rows, and the number of threads per block equal to 256.

Please let me know if there is some way to improve the performance of this kernel without drastically changing things or relying on a bunch of constants. Thanks.

__global__

void convolveAndDownsample( int width, int smallPitch, float* d_filt, float* inData, float* outData ) {

__shared__ float imgLine[1392];

  __shared__ float s_filt[5];

if ( threadIdx.x < 5 ) {

	s_filt[threadIdx.x] = d_filt[threadIdx.x];

  }

inData += blockIdx.x*width;

  outData += blockIdx.x*smallPitch;

int x = threadIdx.x * 2;

#pragma unroll

  for (; x < width; x += ( blockDim.x * 2 ) ) {

	imgLine[x] = inData[x];

	imgLine[x+1] = inData[x+1];

  }

	__syncthreads();

for ( x = threadIdx.x * 2; x < (width-2); x += ( blockDim.x * 2 ) ) {

	float resultPixel = 0;

#pragma unroll

	for( int i=0; i < 5; ++i ) {

	  int current = ABS( x - ( i - 2 ) );

	  if( current >= width ) {

		current = width - 1 - (current-(width-1));

	  }

	  resultPixel += s_filt[i] * imgLine[current];

	}

	outData[x/2] = resultPixel;

  }

}

Hi,

My feeling is that the problem is with read/write latency copying data from/to global arrays, and number of blocks per MP

float imgLine[1392] means this array takes over 1/3rd of the 16k of shared RAM, and that means only 2 blocks can be assigned at a time to a MP.

3 is a better number for hiding read/write latency.

If you can try reducing this to say 1024

for (; x < width; x += ( blockDim.x * 2 ){

	imgLine[x] = inData[x];

	imgLine[x+1] = inData[x+1];

  }

Becomes (for say a width of 1024)

imgLine[threadIdx.x*2] = inData[threadIdx.x*2];

imgLine[1+threadIdx.x*2] = inData[1+threadIdx.x*2];

imgLine[512+threadIdx.x*2] = inData[512+threadIdx.x*2];

imgLine[513+threadIdx.x*2] = inData[513+threadIdx.x*2];

Have you tried doing this with just

int x = threadIdx.x;

for(; x < width; x+=blockDim.x ){

  imgLine[x] = inData[x];

}

The next loop ( after the __syncthreads(); ) is I think usually only runs twice ( for a 1024 wide image )

Maybe it would be better to only run it once per block and change the number of blocks according to the image size.

That would mean that you can further change the previous loop, size of imgLine and block size.

Problem gets worse when you go to column convolution as then all reads and writes are going to be random (slower). With the column convolution it would be better if each half warp read adjacent cells from inData so instead of a block processing 256 columns think about processing a 16*16 part of the image.

Hi,

My feeling is that the problem is with read/write latency copying data from/to global arrays, and number of blocks per MP

float imgLine[1392] means this array takes over 1/3rd of the 16k of shared RAM, and that means only 2 blocks can be assigned at a time to a MP.

3 is a better number for hiding read/write latency.

If you can try reducing this to say 1024

for (; x < width; x += ( blockDim.x * 2 ){

	imgLine[x] = inData[x];

	imgLine[x+1] = inData[x+1];

  }

Becomes (for say a width of 1024)

imgLine[threadIdx.x*2] = inData[threadIdx.x*2];

imgLine[1+threadIdx.x*2] = inData[1+threadIdx.x*2];

imgLine[512+threadIdx.x*2] = inData[512+threadIdx.x*2];

imgLine[513+threadIdx.x*2] = inData[513+threadIdx.x*2];

Have you tried doing this with just

int x = threadIdx.x;

for(; x < width; x+=blockDim.x ){

  imgLine[x] = inData[x];

}

The next loop ( after the __syncthreads(); ) is I think usually only runs twice ( for a 1024 wide image )

Maybe it would be better to only run it once per block and change the number of blocks according to the image size.

That would mean that you can further change the previous loop, size of imgLine and block size.

Problem gets worse when you go to column convolution as then all reads and writes are going to be random (slower). With the column convolution it would be better if each half warp read adjacent cells from inData so instead of a block processing 256 columns think about processing a 16*16 part of the image.

The 9500GT is an 1.1 CC device -> the uncoalesced accesses are converted into 32 transactions of 32 bytes each one.

This code has divergence too in the first if.

The most important is the uncolalesced accesses. It seems that those accesses to inData and [x+1] are uncoalesced and

it has an important penalty in the 1.1 CC.

The 9500GT is an 1.1 CC device -> the uncoalesced accesses are converted into 32 transactions of 32 bytes each one.

This code has divergence too in the first if.

The most important is the uncolalesced accesses. It seems that those accesses to inData and [x+1] are uncoalesced and

it has an important penalty in the 1.1 CC.

The 9500GT is an 1.1 CC device -> the uncoalesced accesses are converted into 32 transactions of 32 bytes each one.

This code has divergence too in the first if.

The most important is the uncolalesced accesses. It seems that those accesses to inData and [x+1] are uncoalesced and

it has an important penalty in the 1.1 CC.