Two loops faster then one?

Hi,

I am writing simple bruteforce nearest neighbours search. My kernel is as follows:

__global__ void kernelBFNNQParallelObject(float* HitsX, float* HitsY, float* HitsZ,const int* N,const int* Ncumul, int pitch, TrdCuSP *SP)

{

  unsigned int pairIdx  = 2 * blockIdx.y;

  unsigned int queryIdx = blockIdx.x * blockDim.x + threadIdx.x;

  unsigned int tid      = threadIdx.x;

  float dist, myMin=10000000.f;

  __shared__ float2 oddHits[1536];

  float2 even,odd;

  int minIdx=0;

  unsigned int start = pitch * pairIdx;

  do {

    oddHits[tid].x = HitsX[start + tid];

    oddHits[tid].y = HitsY[start + tid];

    tid += blockDim.x;;

  } while (tid < N[pairIdx]);

  __syncthreads();

if (queryIdx < N[pairIdx+1]){

    even.x = HitsX[pitch * (pairIdx + 1) + queryIdx];

    even.y = HitsY[pitch * (pairIdx + 1) + queryIdx];

// for (int i = 0; i < 1024; i++){

    for (int i = 0; i < 512; i++){

      odd = oddHits[i];

      dist = (even.x-odd.x)*(even.x-odd.x) + (even.y-odd.y)*(even.y-odd.y);

      myMin = fminf(myMin, dist);

      if (myMin == dist) minIdx = i;

    }

    for (int i = 0; i < 512; i++){

      odd = oddHits[i+512];

      dist = (even.x-odd.x)*(even.x-odd.x) + (even.y-odd.y)*(even.y-odd.y)+0.0001f;

      myMin = fminf(myMin, dist);

      if (myMin == dist) minIdx = i+512;

    }

    // Here should be writes to global memory

}

}

I time this kernel with cuda events. Version with one for from 0 to 1024 takes 132.618ms and version with two fors from 0 to 512 takes 87.768ms. If I didn’t add 0.0001 to dist compiler optimized second loop away, it doesn’t change the result.

My question is: Why differences in timings?

I attach a plot with timings for different upper bound of loop. (there’s a minimum for 16, why?)

GridDim is (3,6000,0) BlockDim is (512,0,0)

Thanks for ideas,

Emil Kaptur