Hi all,

i encoutered a very wired bug, and i’m not 100% shure that it was just a mistake on my side, so please bear with me for a moment.

What i wanted to do is to do feature matching on the gpu. In my case i have 1024 “Database” vectors (each 64 floats long) and 1024 “Query” vectors (also 64 floats long).

What i want to calculate is for each query vector, which is the closest “DB” vector (in the euclidian norm): so in regular c this would look like this:

```
void featureMatch_cpu(
float * g_database, // M features of 16*4 dimensions
float * g_features, // K features of 16*4 dimensions
int * g_indeces, // K output indeces
int M,// #of db features
int K // #of features
){
for (int j=0;j<K;j++){
int minidx=-1;
float minval=1e10;
for (int i=0;i<M;i++)
{
float dist=0.0;
for (int d=0;d<64;d++){
float diff=g_database[i*64+d]-g_features[j*64+d];
dist+=diff*diff;
}
if (dist<minval){
minidx=i;
minval=dist;
}
}
g_indeces[j]=minidx;
}
}
```

In cuda this can be done very quickly of course, but it is a little more complicated. The way we did it was by using 8 threads per feature and blocks of 16 features. so in our case there would be 64 Blocks with each 128 threads. Each block will compare 16 Query features whith the entire 1024 DB features. To make things more efficient, the DB features are loaded 2 at a time into shared memory. So esentially there is a small inner loop which executes twice for every time when 2 DB features are loaded into memory. The catch is now, that the kernel works, if this small loop is manually unrolled but the kernel gives wrong results if this is not done. The rolled version would look like this:

```
#pragma unroll 2
for(unsigned int iter = 0; iter < 2; iter++) {
// calculate feature distance for feature subvector (first even, then odd)
float val = 0;
for(unsigned int n = 0; n < 8; n++) {
float diff = ownfeature[n] - DB_FEATURE(subid + n *8 + iter * 64);
val += diff*diff;
}
//__syncthreads();
// store feature distance in accumulation memory
ACC_DATA(tid) = val;
__syncthreads();
// thread 0 calculates total feature distance
if(subid == 0) {
val +=
ACC_DATA(tid+1) +
ACC_DATA(tid+2) +
ACC_DATA(tid+3) +
ACC_DATA(tid+4) +
ACC_DATA(tid+5) +
ACC_DATA(tid+6) +
ACC_DATA(tid+7);
// keep new feature distance if it is the minimum
index = ( (val < minVal)? (m+iter) : index );
minVal = ( (val < minVal)? val : minVal );
}
__syncthreads();
}
```

where the unrolled version looks like this:

```
//(first even, then odd)
// calculate feature distance for feature subvector (first even, then odd)
float val = 0;
for(unsigned int n = 0; n < 8; n++) {
float diff = ownfeature[n] - DB_FEATURE(subid + n *8 );
val += diff*diff;
}
// store feature distance in accumulation memory
ACC_DATA(tid) = val;
__syncthreads();
// thread 0 calculates total feature distance
if(subid == 0) {
val +=
ACC_DATA(tid+1) +
ACC_DATA(tid+2) +
ACC_DATA(tid+3) +
ACC_DATA(tid+4) +
ACC_DATA(tid+5) +
ACC_DATA(tid+6) +
ACC_DATA(tid+7);
// keep new feature distance if it is the minimum
index = ( (val < minVal)? (m) : index );
minVal = ( (val < minVal)? val : minVal );
}
__syncthreads();
// calculate feature distance for feature subvector (first even, then odd)
val = 0;
for(unsigned int n = 0; n < 8; n++) {
float diff = ownfeature[n] - DB_FEATURE(subid + n *8 + 64);
val += diff*diff;
}
__syncthreads();
// store feature distance in accumulation memory
ACC_DATA(tid) = val;
__syncthreads();
// thread 0 calculates total feature distance
if(subid == 0) {
val +=
ACC_DATA(tid+1) +
ACC_DATA(tid+2) +
ACC_DATA(tid+3) +
ACC_DATA(tid+4) +
ACC_DATA(tid+5) +
ACC_DATA(tid+6) +
ACC_DATA(tid+7);
// keep new feature distance if it is the minimum
index = ( (val < minVal)? (m+1) : index );
minVal = ( (val < minVal)? val : minVal );
}
__syncthreads();
```

When running the programm, then the following happens:

if the code runns on the GPU (tested with GTX280,8800Ultra,8600GT with 177.13 and cuda1.1 cudasdk1.1 on 32bit linux with g++ 4.1), then the unrolled kernel gives the correct result where the rolled kernel give a wrong result.

if run in emulation mode both give the correct result.

(this was done by comparing the results to the cpu version of the code)

the entire code can be found here:

you need to edit the first 2 lines of the Makefile to point to your cuda directory and cuda sdk.

Can anyone tell me if they spotted a mistake in my code? Or is this possible a real compiler bug? All comments are greatly appreciated, and i would be glad is someone could test this on their maschines aswell