First attempt - convolution Non-seperable image convolution

Hi,

This is my fist attempt at writing a CUDA program. The purpose is to convolve an image with a non-seperable filter. I’m enclosing the kernel, and I would really appreciate any comments you might have especially on how to make it run even faster. Executing the program with the visual profiler reveals a large number of gld_incoherent and a occupancy of only 0.667. How can I improve this?

const int halfFilterSize = filter.width();

const dim3 xIndex(halfFilterSize, image.width() - halfFilterSize - 1 - 1);

const dim3 yIndex(halfFilterSize, image.height() - halfFilterSize - 1 - 1);

const dim3 convArea(xIndex.y - xIndex.x + 1, yIndex.y - yIndex.x + 1);

const dim3 dimGrid(17, 15);

const dim3 dimBlock(convArea.x / dimGrid.x, convArea.y / dimGrid.y);

convolve<<<grid, block, memSize>>>(gpuImage, gpuFilter, gpuResult, imageDim, filterDim);

__global__ void convolve(int* image, int* filter, int* result, uint2 imageDim, uint2 filterDim) {

	const int dim = (filterDim.x - 1) / 2;

	const int x = blockIdx.x * blockDim.x + filterDim.x + threadIdx.x;

	const int y = blockIdx.y * blockDim.y + filterDim.x + threadIdx.y;

	int sum = 0;

	for(int i = -dim; i <= dim; i++) {

  for(int j = -dim; j <= dim; ++j) {

  	const int filterX = i + dim;

  	const int filterY = j + dim;

  	const int imageX = j + x;

  	const int imageY = i + y;

  	int filterVal = filter[filterY * filterDim.x + filterX];

  	int pixelVal = image[imageY * imageDim.x + imageX];

  	sum += filterVal * pixelVal;

  }

	}

	result[y * imageDim.x + x] = sum;

}

Exampel of program output:

Image dimensions: (320, 256)

Filter dimensions: (15, 15)

Convolution area: (289, 225), [15;303][15;239]

Number of pixels to calculate: 65025

Number of integer operations: 29261250

Grid dimensions: 17 * 15 blocks

Block dimensions: 17 * 15 threads

CPU execution time: 255.851003

GPU execution time: 22.713926

GPU execution time: 22.690844

GPU execution time: 22.784740

Thanks in advance!

Andreas

This looks like a place where using a texture for filter and image would improve performance significantly. Texture reads are cached, and are good for reducing uncoalesced reads when your read pattern is not linear, but still has spatial locality. Check out the simpleTexture example in the SDK. It will show you how to allocate image and filter in the special array format, bind them to texture references, and then access them from your device code.

Hi Seibert.

Thanks for your reply. I have tried two optimizations now. The first is to load the filter into the symbol memory and the second is to use textures for filter and image as you suggested. Here is the result:

Image dimensions: (320, 256)

Filter dimensions: (15, 15)

Convolution area: (289, 225), [15;303][15;239]

Number of pixels to calculate: 65025

Number of integer operations: 29261250

Grid dimensions: 17 * 15 blocks

Block dimensions: 17 * 15 threads

CPU execution time: 258.864685 ms

GPU execution time: 9.637045 ms

GPU execution time: 9.647624 ms

GPU execution time: 9.480155 ms

GPU and CPU results match.

GPU texture execution time: 0.051067 ms

GPU texture execution time: 0.042767 ms

GPU texture execution time: 0.042779 ms

GPU and CPU results match.

But now I’m amazed by the 5000x speed-up. I’m using QueryPerformanceCounter on Windows and cutCreateTimer on the GPU to measure the CPU (Intel Core 2 Duo) and GPU execution time. Is this really realistic numbers?

Thanks in advance,

Andreas

Make sure you have a CUDA_SAFE_CALL(cudaThreadSynchronize()); after your kernel call before calculating the execution time.

Ah, good idea about the symbol memory for filter. Symbol (or “constant”) memory is best for broadcast reads, which you have since all threads in a warp read the same filter element at the same time. You might find that using constant memory for the filter and texture memory for the image is the overall best option.