Another launch timeout...

Hello,

my kernel fails with the error “ the launch timed out and was terminated ” depending on the size of input data.

Kernel code

__global__ void foo(float *src, float* dst, int pitch, int width, int height, float constx2, float consty2)

{

  // get coordinates of current pixel from the gpu runtime variables

  unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

  unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; 

 // check wether this pixel lies within the image dimensions

  if (x < width && y < height) {   

   

    // init

    float fw = 0;     

    float fwx = 0;   

    float fwy = 0;    

    float w = 0;     

    float wx = 0;     

    float wy = 0;    

    float wx2 = 0;   

    float wy2 = 0;  

    float wxy = 0;    

   // sum up

    for (int i = 0; i < width; ++i) {

      for (int j = 0; j < height; ++j) {

        int index = j * pitch + i;

        float elem = src[index];       

       int xdat = i - x;

        int xdat2 = xdat * xdat;

        int ydat = j - y;

        int ydat2 = ydat * ydat;

       float weight = exp(- (xdat2/constx2 + ydat2/consty2));

       // update counting sums

        float weight_elem = weight * elem;

        fw += weight_elem;

        fwx += weight_elem * xdat;

        fwy += weight_elem * ydat;

        w += weight;

        wx += weight * xdat;

        wy += weight * ydat;

        wx2 += weight * xdat2;

        wy2 += weight * ydat2;

        wxy += weight * xdat * ydat;       

      }

    }

    

    float det = w*wx2*wy2 + wx*wxy*wy + wy*wx*wxy - 

                wy*wx2*wy - wxy*wxy*w - wy2*wx*wx;    

   unsigned int center = y * pitch + x;    

    dst = ((wx2*wy2-wxy*wxy) * fw - (wx*wy2-wxy*wy) * fwx + (wx*wxy-wx2*wy) * fwy) / det;

  }

}

I call this kernel with square data sets of size 16x16, 32x32, … , 400x400. The results of successfully computed datasets are numerically correct (checked with CPU algorithm). Usually the launch error happens when computing 384x384, sometimes it happens earlier.

Program Output

Computing 16 x 16 dataset: Overall convolution procedure on GPU took 0.47298 ms.

Computing 32 x 32 dataset: Overall convolution procedure on GPU took 1.61663 ms.

Computing 48 x 48 dataset: Overall convolution procedure on GPU took 4.03552 ms.

Computing 64 x 64 dataset: Overall convolution procedure on GPU took 11.3857 ms.

Computing 80 x 80 dataset: Overall convolution procedure on GPU took 29.6199 ms.

Computing 96 x 96 dataset: Overall convolution procedure on GPU took 56.4099 ms.

...

Computing 352 x 352 dataset: Overall convolution procedure on GPU took 10262 ms.

Computing 368 x 368 dataset: Overall convolution procedure on GPU took 12278.6 ms.

Computing 384 x 384 dataset: CUDA Error in cudaSeewigConvolutionKernel(), line 91: the launch timed out and was terminated

Additional info

System:

Windows XP SP3, Driver version 178_08, 2x 9800GX2

NVCC output:

Local: 0

Shared: 44

Registers: 24

Blocksize: 8x8

Has anyone got an idea what happens here?

Thanks in advance

Seems like you’re hitting wathcdog bug which wasn’t completely fixed in 178.08.

Check this thread: [url=“http://forums.nvidia.com/index.php?showforum=70”]http://forums.nvidia.com/index.php?showforum=70[/url]

Looks like this is applies here - it seems to be correctly working in an Linux environment.

Thanks a lot