Hello,
my kernel fails with the error “ the launch timed out and was terminated ” depending on the size of input data.
Kernel code
__global__ void foo(float *src, float* dst, int pitch, int width, int height, float constx2, float consty2)
{
// get coordinates of current pixel from the gpu runtime variables
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
// check wether this pixel lies within the image dimensions
if (x < width && y < height) {
// init
float fw = 0;
float fwx = 0;
float fwy = 0;
float w = 0;
float wx = 0;
float wy = 0;
float wx2 = 0;
float wy2 = 0;
float wxy = 0;
// sum up
for (int i = 0; i < width; ++i) {
for (int j = 0; j < height; ++j) {
int index = j * pitch + i;
float elem = src[index];
int xdat = i - x;
int xdat2 = xdat * xdat;
int ydat = j - y;
int ydat2 = ydat * ydat;
float weight = exp(- (xdat2/constx2 + ydat2/consty2));
// update counting sums
float weight_elem = weight * elem;
fw += weight_elem;
fwx += weight_elem * xdat;
fwy += weight_elem * ydat;
w += weight;
wx += weight * xdat;
wy += weight * ydat;
wx2 += weight * xdat2;
wy2 += weight * ydat2;
wxy += weight * xdat * ydat;
}
}
float det = w*wx2*wy2 + wx*wxy*wy + wy*wx*wxy -
wy*wx2*wy - wxy*wxy*w - wy2*wx*wx;
unsigned int center = y * pitch + x;
dst = ((wx2*wy2-wxy*wxy) * fw - (wx*wy2-wxy*wy) * fwx + (wx*wxy-wx2*wy) * fwy) / det;
}
}
I call this kernel with square data sets of size 16x16, 32x32, … , 400x400. The results of successfully computed datasets are numerically correct (checked with CPU algorithm). Usually the launch error happens when computing 384x384, sometimes it happens earlier.
Program Output
Computing 16 x 16 dataset: Overall convolution procedure on GPU took 0.47298 ms.
Computing 32 x 32 dataset: Overall convolution procedure on GPU took 1.61663 ms.
Computing 48 x 48 dataset: Overall convolution procedure on GPU took 4.03552 ms.
Computing 64 x 64 dataset: Overall convolution procedure on GPU took 11.3857 ms.
Computing 80 x 80 dataset: Overall convolution procedure on GPU took 29.6199 ms.
Computing 96 x 96 dataset: Overall convolution procedure on GPU took 56.4099 ms.
...
Computing 352 x 352 dataset: Overall convolution procedure on GPU took 10262 ms.
Computing 368 x 368 dataset: Overall convolution procedure on GPU took 12278.6 ms.
Computing 384 x 384 dataset: CUDA Error in cudaSeewigConvolutionKernel(), line 91: the launch timed out and was terminated
Additional info
System:
Windows XP SP3, Driver version 178_08, 2x 9800GX2
NVCC output:
Local: 0
Shared: 44
Registers: 24
Blocksize: 8x8
Has anyone got an idea what happens here?
Thanks in advance