Why is this kernel slow?

bdbd · September 22, 2010, 12:48am

I am doing a simple separable convolution with a filter of length 5 on a large 1K by 1K pixel image. I am using a GeForce 9500GT.

I am not doing it the same way as the separable convolution example that comes with CUDA SDK, and because of the way that this kernel is used, and the varying of image size, it would be difficult to change the code to match the CUDA SDK implementation.

I tested removing the if statements to see if it was the reason the kernel is slow, but they only seemed to slow the kernel down slightly. I included them to do reflection around image edges.

I also do pretty much the exact same thing for column convolution. I’m wondering if the column accesses are slow because they go by column instead of row. Should I use a texture for the column kernels in order to speed up the vertical accesses?

I am calling these functions with the number of blocks equal to the number of columns/rows, and the number of threads per block equal to 256.

Please let me know if there is some way to improve the performance of this kernel without drastically changing things or relying on a bunch of constants. Thanks.

__global__

void convolveAndDownsample( int width, int smallPitch, float* d_filt, float* inData, float* outData ) {

__shared__ float imgLine[1392];

  __shared__ float s_filt[5];

if ( threadIdx.x < 5 ) {

	s_filt[threadIdx.x] = d_filt[threadIdx.x];

  }

inData += blockIdx.x*width;

  outData += blockIdx.x*smallPitch;

int x = threadIdx.x * 2;

#pragma unroll

  for (; x < width; x += ( blockDim.x * 2 ) ) {

	imgLine[x] = inData[x];

	imgLine[x+1] = inData[x+1];

  }

	__syncthreads();

for ( x = threadIdx.x * 2; x < (width-2); x += ( blockDim.x * 2 ) ) {

	float resultPixel = 0;

#pragma unroll

	for( int i=0; i < 5; ++i ) {

	  int current = ABS( x - ( i - 2 ) );

	  if( current >= width ) {

		current = width - 1 - (current-(width-1));

	  }

	  resultPixel += s_filt[i] * imgLine[current];

	}

	outData[x/2] = resultPixel;

  }

}

kbam · September 22, 2010, 8:22am

Hi,

My feeling is that the problem is with read/write latency copying data from/to global arrays, and number of blocks per MP

float imgLine[1392] means this array takes over 1/3rd of the 16k of shared RAM, and that means only 2 blocks can be assigned at a time to a MP.

3 is a better number for hiding read/write latency.

If you can try reducing this to say 1024

for (; x < width; x += ( blockDim.x * 2 ){

	imgLine[x] = inData[x];

	imgLine[x+1] = inData[x+1];

  }

Becomes (for say a width of 1024)

imgLine[threadIdx.x*2] = inData[threadIdx.x*2];

imgLine[1+threadIdx.x*2] = inData[1+threadIdx.x*2];

imgLine[512+threadIdx.x*2] = inData[512+threadIdx.x*2];

imgLine[513+threadIdx.x*2] = inData[513+threadIdx.x*2];

Have you tried doing this with just

int x = threadIdx.x;

for(; x < width; x+=blockDim.x ){

  imgLine[x] = inData[x];

}

The next loop ( after the __syncthreads(); ) is I think usually only runs twice ( for a 1024 wide image )

Maybe it would be better to only run it once per block and change the number of blocks according to the image size.

That would mean that you can further change the previous loop, size of imgLine and block size.

Problem gets worse when you go to column convolution as then all reads and writes are going to be random (slower). With the column convolution it would be better if each half warp read adjacent cells from inData so instead of a block processing 256 columns think about processing a 16*16 part of the image.

kbam · September 22, 2010, 8:22am

Hi,

My feeling is that the problem is with read/write latency copying data from/to global arrays, and number of blocks per MP

float imgLine[1392] means this array takes over 1/3rd of the 16k of shared RAM, and that means only 2 blocks can be assigned at a time to a MP.

3 is a better number for hiding read/write latency.

If you can try reducing this to say 1024

for (; x < width; x += ( blockDim.x * 2 ){

	imgLine[x] = inData[x];

	imgLine[x+1] = inData[x+1];

  }

Becomes (for say a width of 1024)

imgLine[threadIdx.x*2] = inData[threadIdx.x*2];

imgLine[1+threadIdx.x*2] = inData[1+threadIdx.x*2];

imgLine[512+threadIdx.x*2] = inData[512+threadIdx.x*2];

imgLine[513+threadIdx.x*2] = inData[513+threadIdx.x*2];

Have you tried doing this with just

int x = threadIdx.x;

for(; x < width; x+=blockDim.x ){

  imgLine[x] = inData[x];

}

The next loop ( after the __syncthreads(); ) is I think usually only runs twice ( for a 1024 wide image )

Maybe it would be better to only run it once per block and change the number of blocks according to the image size.

That would mean that you can further change the previous loop, size of imgLine and block size.

Problem gets worse when you go to column convolution as then all reads and writes are going to be random (slower). With the column convolution it would be better if each half warp read adjacent cells from inData so instead of a block processing 256 columns think about processing a 16*16 part of the image.

insmvb00 · September 27, 2010, 4:01pm

The 9500GT is an 1.1 CC device → the uncoalesced accesses are converted into 32 transactions of 32 bytes each one.

This code has divergence too in the first if.

The most important is the uncolalesced accesses. It seems that those accesses to inData and [x+1] are uncoalesced and

it has an important penalty in the 1.1 CC.

Hi,

My feeling is that the problem is with read/write latency copying data from/to global arrays, and number of blocks per MP

float imgLine[1392] means this array takes over 1/3rd of the 16k of shared RAM, and that means only 2 blocks can be assigned at a time to a MP.

3 is a better number for hiding read/write latency.

If you can try reducing this to say 1024
for (; x < width; x += ( blockDim.x * 2 ){

	imgLine[x] = inData[x];

	imgLine[x+1] = inData[x+1];

  }
Becomes (for say a width of 1024)
imgLine[threadIdx.x*2] = inData[threadIdx.x*2];

imgLine[1+threadIdx.x*2] = inData[1+threadIdx.x*2];

imgLine[512+threadIdx.x*2] = inData[512+threadIdx.x*2];

imgLine[513+threadIdx.x*2] = inData[513+threadIdx.x*2];
Have you tried doing this with just
int x = threadIdx.x;

for(; x < width; x+=blockDim.x ){

  imgLine[x] = inData[x];

}
The next loop ( after the __syncthreads(); ) is I think usually only runs twice ( for a 1024 wide image )

Maybe it would be better to only run it once per block and change the number of blocks according to the image size.

That would mean that you can further change the previous loop, size of imgLine and block size.

Problem gets worse when you go to column convolution as then all reads and writes are going to be random (slower). With the column convolution it would be better if each half warp read adjacent cells from inData so instead of a block processing 256 columns think about processing a 16*16 part of the image.

insmvb00 · September 27, 2010, 4:01pm

The 9500GT is an 1.1 CC device → the uncoalesced accesses are converted into 32 transactions of 32 bytes each one.

This code has divergence too in the first if.

The most important is the uncolalesced accesses. It seems that those accesses to inData and [x+1] are uncoalesced and

it has an important penalty in the 1.1 CC.

Hi,

My feeling is that the problem is with read/write latency copying data from/to global arrays, and number of blocks per MP

float imgLine[1392] means this array takes over 1/3rd of the 16k of shared RAM, and that means only 2 blocks can be assigned at a time to a MP.

3 is a better number for hiding read/write latency.

If you can try reducing this to say 1024
for (; x < width; x += ( blockDim.x * 2 ){

	imgLine[x] = inData[x];

	imgLine[x+1] = inData[x+1];

  }
Becomes (for say a width of 1024)
imgLine[threadIdx.x*2] = inData[threadIdx.x*2];

imgLine[1+threadIdx.x*2] = inData[1+threadIdx.x*2];

imgLine[512+threadIdx.x*2] = inData[512+threadIdx.x*2];

imgLine[513+threadIdx.x*2] = inData[513+threadIdx.x*2];
Have you tried doing this with just
int x = threadIdx.x;

for(; x < width; x+=blockDim.x ){

  imgLine[x] = inData[x];

}
The next loop ( after the __syncthreads(); ) is I think usually only runs twice ( for a 1024 wide image )

Maybe it would be better to only run it once per block and change the number of blocks according to the image size.

That would mean that you can further change the previous loop, size of imgLine and block size.

Problem gets worse when you go to column convolution as then all reads and writes are going to be random (slower). With the column convolution it would be better if each half warp read adjacent cells from inData so instead of a block processing 256 columns think about processing a 16*16 part of the image.

insmvb00 · September 27, 2010, 4:01pm

The 9500GT is an 1.1 CC device → the uncoalesced accesses are converted into 32 transactions of 32 bytes each one.

This code has divergence too in the first if.

The most important is the uncolalesced accesses. It seems that those accesses to inData and [x+1] are uncoalesced and

it has an important penalty in the 1.1 CC.

Hi,

My feeling is that the problem is with read/write latency copying data from/to global arrays, and number of blocks per MP

float imgLine[1392] means this array takes over 1/3rd of the 16k of shared RAM, and that means only 2 blocks can be assigned at a time to a MP.

3 is a better number for hiding read/write latency.

If you can try reducing this to say 1024
for (; x < width; x += ( blockDim.x * 2 ){

	imgLine[x] = inData[x];

	imgLine[x+1] = inData[x+1];

  }
Becomes (for say a width of 1024)
imgLine[threadIdx.x*2] = inData[threadIdx.x*2];

imgLine[1+threadIdx.x*2] = inData[1+threadIdx.x*2];

imgLine[512+threadIdx.x*2] = inData[512+threadIdx.x*2];

imgLine[513+threadIdx.x*2] = inData[513+threadIdx.x*2];
Have you tried doing this with just
int x = threadIdx.x;

for(; x < width; x+=blockDim.x ){

  imgLine[x] = inData[x];

}
The next loop ( after the __syncthreads(); ) is I think usually only runs twice ( for a 1024 wide image )

Maybe it would be better to only run it once per block and change the number of blocks according to the image size.

That would mean that you can further change the previous loop, size of imgLine and block size.

Problem gets worse when you go to column convolution as then all reads and writes are going to be random (slower). With the column convolution it would be better if each half warp read adjacent cells from inData so instead of a block processing 256 columns think about processing a 16*16 part of the image.

Topic		Replies	Views
Convolution Texture with Shared Memory CUDA Programming and Performance	3	530	April 15, 2024
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6063	November 16, 2010
Image Convolution with CUDA paper Not quite understanding the tiling method they're showing CUDA Programming and Performance	9	6296	January 11, 2018
Array Problem CUDA Programming and Performance	5	1462	January 27, 2010
Re_arranging Cuda array CUDA Programming and Performance	8	50	September 23, 2024
Kernel Convolution with streams provides no benefit CUDA Programming and Performance	4	40	January 20, 2025
Texture memory fetch extremely slow CUDA Programming and Performance	13	3125	December 21, 2017
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16308	January 30, 2011
CUDA and Image Processing CUDA Programming and Performance	45	22137	August 5, 2008
Performance issues on memory transfer CUDA Programming and Performance	13	12982	November 26, 2010

Why is this kernel slow?

Related topics