First attempt - convolution Non-seperable image convolution

mumlebux · April 12, 2008, 9:26pm

Hi,

This is my fist attempt at writing a CUDA program. The purpose is to convolve an image with a non-seperable filter. I’m enclosing the kernel, and I would really appreciate any comments you might have especially on how to make it run even faster. Executing the program with the visual profiler reveals a large number of gld_incoherent and a occupancy of only 0.667. How can I improve this?

const int halfFilterSize = filter.width();

const dim3 xIndex(halfFilterSize, image.width() - halfFilterSize - 1 - 1);

const dim3 yIndex(halfFilterSize, image.height() - halfFilterSize - 1 - 1);

const dim3 convArea(xIndex.y - xIndex.x + 1, yIndex.y - yIndex.x + 1);

const dim3 dimGrid(17, 15);

const dim3 dimBlock(convArea.x / dimGrid.x, convArea.y / dimGrid.y);

convolve<<<grid, block, memSize>>>(gpuImage, gpuFilter, gpuResult, imageDim, filterDim);

__global__ void convolve(int* image, int* filter, int* result, uint2 imageDim, uint2 filterDim) {

	const int dim = (filterDim.x - 1) / 2;

	const int x = blockIdx.x * blockDim.x + filterDim.x + threadIdx.x;

	const int y = blockIdx.y * blockDim.y + filterDim.x + threadIdx.y;

	int sum = 0;

	for(int i = -dim; i <= dim; i++) {

  for(int j = -dim; j <= dim; ++j) {

  	const int filterX = i + dim;

  	const int filterY = j + dim;

  	const int imageX = j + x;

  	const int imageY = i + y;

  	int filterVal = filter[filterY * filterDim.x + filterX];

  	int pixelVal = image[imageY * imageDim.x + imageX];

  	sum += filterVal * pixelVal;

  }

	}

	result[y * imageDim.x + x] = sum;

}

Exampel of program output:

Image dimensions: (320, 256)

Filter dimensions: (15, 15)

Convolution area: (289, 225), [15;303][15;239]

Number of pixels to calculate: 65025

Number of integer operations: 29261250

Grid dimensions: 17 * 15 blocks

Block dimensions: 17 * 15 threads

CPU execution time: 255.851003

GPU execution time: 22.713926

GPU execution time: 22.690844

GPU execution time: 22.784740

Thanks in advance!

Andreas

seibert · April 12, 2008, 10:15pm

This looks like a place where using a texture for filter and image would improve performance significantly. Texture reads are cached, and are good for reducing uncoalesced reads when your read pattern is not linear, but still has spatial locality. Check out the simpleTexture example in the SDK. It will show you how to allocate image and filter in the special array format, bind them to texture references, and then access them from your device code.

mumlebux · April 13, 2008, 3:43pm

Hi Seibert.

Thanks for your reply. I have tried two optimizations now. The first is to load the filter into the symbol memory and the second is to use textures for filter and image as you suggested. Here is the result:

Image dimensions: (320, 256)

Filter dimensions: (15, 15)

Convolution area: (289, 225), [15;303][15;239]

Number of pixels to calculate: 65025

Number of integer operations: 29261250

Grid dimensions: 17 * 15 blocks

Block dimensions: 17 * 15 threads

CPU execution time: 258.864685 ms

GPU execution time: 9.637045 ms

GPU execution time: 9.647624 ms

GPU execution time: 9.480155 ms

GPU and CPU results match.

GPU texture execution time: 0.051067 ms

GPU texture execution time: 0.042767 ms

GPU texture execution time: 0.042779 ms

GPU and CPU results match.

But now I’m amazed by the 5000x speed-up. I’m using QueryPerformanceCounter on Windows and cutCreateTimer on the GPU to measure the CPU (Intel Core 2 Duo) and GPU execution time. Is this really realistic numbers?

Thanks in advance,

Andreas

papag62 · April 13, 2008, 4:38pm

Make sure you have a CUDA_SAFE_CALL(cudaThreadSynchronize()); after your kernel call before calculating the execution time.

Hi Seibert.

Thanks for your reply. I have tried two optimizations now. The first is to load the filter into the symbol memory and the second is to use textures for filter and image as you suggested. Here is the result:
Image dimensions: (320, 256)

Filter dimensions: (15, 15)

Convolution area: (289, 225), [15;303][15;239]

Number of pixels to calculate: 65025

Number of integer operations: 29261250

Grid dimensions: 17 * 15 blocks

Block dimensions: 17 * 15 threads

CPU execution time: 258.864685 ms

GPU execution time: 9.637045 ms

GPU execution time: 9.647624 ms

GPU execution time: 9.480155 ms

GPU and CPU results match.

GPU texture execution time: 0.051067 ms

GPU texture execution time: 0.042767 ms

GPU texture execution time: 0.042779 ms

GPU and CPU results match.
But now I’m amazed by the 5000x speed-up. I’m using QueryPerformanceCounter on Windows and cutCreateTimer on the GPU to measure the CPU (Intel Core 2 Duo) and GPU execution time. Is this really realistic numbers?

Thanks in advance,

Andreas

[snapback]361598[/snapback]

seibert · April 13, 2008, 5:36pm

Ah, good idea about the symbol memory for filter. Symbol (or “constant”) memory is best for broadcast reads, which you have since all threads in a warp read the same filter element at the same time. You might find that using constant memory for the filter and texture memory for the image is the overall best option.

Topic		Replies	Views
Texture memory fetch extremely slow CUDA Programming and Performance	13	3119	December 21, 2017
Image Convolution [src added] CUDA Programming and Performance	3	3890	November 28, 2007
CUDA OpenGL post-processing example CUDA Programming and Performance	9	13247	May 27, 2007
Low occupancy ratio using texture memory Image correlation using texture memory CUDA Programming and Performance	2	4693	September 20, 2008
Kernel Convolution with streams provides no benefit CUDA Programming and Performance	4	36	January 20, 2025
CUDA Image Processing Demo & Soure code&Tutorials CUDA Programming and Performance	7	25046	April 2, 2007
Texture Memory does not improve the performance CUDA Programming and Performance	1	480	January 31, 2018
improving algorithm performance CUDA Programming and Performance	2	3285	January 31, 2008
Optimizing color channels in image processing (Gaussian blur) CUDA Programming and Performance	4	1822	June 18, 2017
[Help] Kernel Optimization Image subsampling CUDA Programming and Performance	2	4211	July 30, 2007

First attempt - convolution Non-seperable image convolution

Related topics