weird kernel behavior, bug with kernel divergency?

Dear readers,

I am running into a very strange behaviour:

I am writing a template matcher. For each pixel from a reference image it searches the most similar one in some other image. When it went through all the search space it stores at the reference pixel address, the coordinates of the best match.

This is a divergent kernel because for each kernel instance, the best match will happen at different stages inside the for loops.

BUT this kernel never executes. It’s like as soon as it is called it returns immediately; without ever writing anything to the output buffer.

However, if I change the last write

dst[imageW * iy + ix]=best_match;

to

dst[imageW * iy + ix]=make_float(0.3, 0.3, 0.3, 0.3);

Everything works fine; the frame rate drops significantly (meaning that it is entering all the for loops) and it writes values to the output buffer.

IE, if what I write to dst DOESNT come from something with a divergent behaviour everything works fine. If the value does originate from some divergent code then the whole kernel goes into some sort of “default mode”.

I haven’t been able to overcome this problem; no matter how I reshuffle things, I end up having a divergent behaviour and the kernel goes to “default mode”.

Is this a bug?

Any suggestions to get this working?

Thanks!

Dimitri

__global__ void Track(float4* dst,

                      int imageW,

                      int imageH){

 const int ix = blockDim.x * blockIdx.x + threadIdx.x;

  const int iy = blockDim.y * blockIdx.y + threadIdx.y;

  //Add half of a texel to always address exact texel centers

  const float x = (float)ix + 0.5f;

  const float y = (float)iy + 0.5f;

  int track_r=2;

  float tx,ty;

  int search_r=10;

  float sx,sy;

  float bestx, besty;

  float best_ssd=100000000.0;

  float4 ref, search, d;

  float4 best_match= make_float4(x, y, 0, 0);

  

 if(ix < imageW && iy < imageH){ 

   

    for(sx=-search_r;sx<=search_r; sx++){

      for(sx=-search_r;sx<=search_r; sx++){

       

        float ssd=0;

        for(ty=-track_r;ty<=track_r;ty++){

          for(tx=-track_r;tx<=track_r;tx++){

            

     ref= tex2D(ftexsrc1, x+tx ,y+ty);

     search= tex2D(ftexsrc2, x+tx+sx, y+ty+sy);

     d.x=ref.x-search.x;

     d.y=ref.y-search.y;

     d.z=ref.z-search.z;

     

     ssd+=(d.x*d.x +d.y*d.y +d.z*d.z);

   }}

       

	bestx= bestx;

	besty= besty;

	

	if(ssd<best_ssd){

    best_ssd = ssd;

           best_match= make_float4(x+sx, y+sy, 0, 0);

        }

      }}

   

    dst[imageW * iy + ix]=best_match;

    

  }

}