Hi,

I am writing simple bruteforce nearest neighbours search. My kernel is as follows:

```
__global__ void kernelBFNNQParallelObject(float* HitsX, float* HitsY, float* HitsZ,const int* N,const int* Ncumul, int pitch, TrdCuSP *SP)
{
unsigned int pairIdx = 2 * blockIdx.y;
unsigned int queryIdx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tid = threadIdx.x;
float dist, myMin=10000000.f;
__shared__ float2 oddHits[1536];
float2 even,odd;
int minIdx=0;
unsigned int start = pitch * pairIdx;
do {
oddHits[tid].x = HitsX[start + tid];
oddHits[tid].y = HitsY[start + tid];
tid += blockDim.x;;
} while (tid < N[pairIdx]);
__syncthreads();
if (queryIdx < N[pairIdx+1]){
even.x = HitsX[pitch * (pairIdx + 1) + queryIdx];
even.y = HitsY[pitch * (pairIdx + 1) + queryIdx];
// for (int i = 0; i < 1024; i++){
for (int i = 0; i < 512; i++){
odd = oddHits[i];
dist = (even.x-odd.x)*(even.x-odd.x) + (even.y-odd.y)*(even.y-odd.y);
myMin = fminf(myMin, dist);
if (myMin == dist) minIdx = i;
}
for (int i = 0; i < 512; i++){
odd = oddHits[i+512];
dist = (even.x-odd.x)*(even.x-odd.x) + (even.y-odd.y)*(even.y-odd.y)+0.0001f;
myMin = fminf(myMin, dist);
if (myMin == dist) minIdx = i+512;
}
// Here should be writes to global memory
}
}
```

I time this kernel with cuda events. Version with one for from 0 to 1024 takes 132.618ms and version with two fors from 0 to 512 takes 87.768ms. If I didn’t add 0.0001 to dist compiler optimized second loop away, it doesn’t change the result.

My question is: Why differences in timings?

I attach a plot with timings for different upper bound of loop. (there’s a minimum for 16, why?)

GridDim is (3,6000,0) BlockDim is (512,0,0)

Thanks for ideas,

Emil Kaptur