# Two loops faster then one?

Hi,

I am writing simple bruteforce nearest neighbours search. My kernel is as follows:

``````__global__ void kernelBFNNQParallelObject(float* HitsX, float* HitsY, float* HitsZ,const int* N,const int* Ncumul, int pitch, TrdCuSP *SP)

{

unsigned int pairIdx  = 2 * blockIdx.y;

unsigned int queryIdx = blockIdx.x * blockDim.x + threadIdx.x;

float dist, myMin=10000000.f;

__shared__ float2 oddHits;

float2 even,odd;

int minIdx=0;

unsigned int start = pitch * pairIdx;

do {

oddHits[tid].x = HitsX[start + tid];

oddHits[tid].y = HitsY[start + tid];

tid += blockDim.x;;

} while (tid < N[pairIdx]);

if (queryIdx < N[pairIdx+1]){

even.x = HitsX[pitch * (pairIdx + 1) + queryIdx];

even.y = HitsY[pitch * (pairIdx + 1) + queryIdx];

// for (int i = 0; i < 1024; i++){

for (int i = 0; i < 512; i++){

odd = oddHits[i];

dist = (even.x-odd.x)*(even.x-odd.x) + (even.y-odd.y)*(even.y-odd.y);

myMin = fminf(myMin, dist);

if (myMin == dist) minIdx = i;

}

for (int i = 0; i < 512; i++){

odd = oddHits[i+512];

dist = (even.x-odd.x)*(even.x-odd.x) + (even.y-odd.y)*(even.y-odd.y)+0.0001f;

myMin = fminf(myMin, dist);

if (myMin == dist) minIdx = i+512;

}

// Here should be writes to global memory

}

}
``````

I time this kernel with cuda events. Version with one for from 0 to 1024 takes 132.618ms and version with two fors from 0 to 512 takes 87.768ms. If I didn’t add 0.0001 to dist compiler optimized second loop away, it doesn’t change the result.

My question is: Why differences in timings?

I attach a plot with timings for different upper bound of loop. (there’s a minimum for 16, why?)

GridDim is (3,6000,0) BlockDim is (512,0,0)

Thanks for ideas,

Emil Kaptur