Possible compiler bug

Hi all,

i encoutered a very wired bug, and i’m not 100% shure that it was just a mistake on my side, so please bear with me for a moment.

What i wanted to do is to do feature matching on the gpu. In my case i have 1024 “Database” vectors (each 64 floats long) and 1024 “Query” vectors (also 64 floats long).

What i want to calculate is for each query vector, which is the closest “DB” vector (in the euclidian norm): so in regular c this would look like this:

void featureMatch_cpu(

  float * g_database, // M features of 16*4 dimensions

  float * g_features, // K features of 16*4 dimensions

  int * g_indeces, // K output indeces

  int M,// #of db features

  int K // #of features

){

	for (int j=0;j<K;j++){

  int minidx=-1;

  float minval=1e10;

  for (int i=0;i<M;i++)

  {

  	float dist=0.0;

  	for (int d=0;d<64;d++){

    float diff=g_database[i*64+d]-g_features[j*64+d];

    dist+=diff*diff;

  	}

  	if (dist<minval){

    minidx=i;

    minval=dist;

  	}

  }

  g_indeces[j]=minidx;

	}

}

In cuda this can be done very quickly of course, but it is a little more complicated. The way we did it was by using 8 threads per feature and blocks of 16 features. so in our case there would be 64 Blocks with each 128 threads. Each block will compare 16 Query features whith the entire 1024 DB features. To make things more efficient, the DB features are loaded 2 at a time into shared memory. So esentially there is a small inner loop which executes twice for every time when 2 DB features are loaded into memory. The catch is now, that the kernel works, if this small loop is manually unrolled but the kernel gives wrong results if this is not done. The rolled version would look like this:

#pragma unroll 2

  for(unsigned int iter = 0; iter < 2; iter++) {

 	// calculate feature distance for feature subvector (first even, then odd)

  	float val = 0;

 	for(unsigned int n = 0; n < 8; n++) {

    float diff = ownfeature[n] - DB_FEATURE(subid + n *8 + iter * 64);

    val += diff*diff;

  	}

  	//__syncthreads();

 	// store feature distance in accumulation memory

  	ACC_DATA(tid) = val;

 	__syncthreads();

 	// thread 0 calculates total feature distance

  	if(subid == 0) {

    val +=

    	ACC_DATA(tid+1) +

    	ACC_DATA(tid+2) +

    	ACC_DATA(tid+3) +

    	ACC_DATA(tid+4) +

    	ACC_DATA(tid+5) +

    	ACC_DATA(tid+6) +

    	ACC_DATA(tid+7);

   // keep new feature distance if it is the minimum

    index = ( (val < minVal)? (m+iter) : index );

    minVal = ( (val < minVal)? val : minVal );

  	}

  	__syncthreads();

  }

where the unrolled version looks like this:

//(first even, then odd)

  // calculate feature distance for feature subvector (first even, then odd)

  float val = 0;

 for(unsigned int n = 0; n < 8; n++) {

  	float diff = ownfeature[n] - DB_FEATURE(subid + n *8 );

  	val += diff*diff;

  }

 // store feature distance in accumulation memory

  ACC_DATA(tid) = val;

 __syncthreads();

 // thread 0 calculates total feature distance

  if(subid == 0) {

  	val +=

    ACC_DATA(tid+1) +

    ACC_DATA(tid+2) +

    ACC_DATA(tid+3) +

    ACC_DATA(tid+4) +

    ACC_DATA(tid+5) +

    ACC_DATA(tid+6) +

    ACC_DATA(tid+7);

 	// keep new feature distance if it is the minimum

  	index = ( (val < minVal)? (m) : index );

  	minVal = ( (val < minVal)? val : minVal );

  }

  __syncthreads();

  // calculate feature distance for feature subvector (first even, then odd)

  val = 0;

 for(unsigned int n = 0; n < 8; n++) {

  	float diff = ownfeature[n] - DB_FEATURE(subid + n *8 + 64);

  	val += diff*diff;

  }

  __syncthreads();

 // store feature distance in accumulation memory

  ACC_DATA(tid) = val;

 __syncthreads();

 // thread 0 calculates total feature distance

  if(subid == 0) {

  	val +=

    ACC_DATA(tid+1) +

    ACC_DATA(tid+2) +

    ACC_DATA(tid+3) +

    ACC_DATA(tid+4) +

    ACC_DATA(tid+5) +

    ACC_DATA(tid+6) +

    ACC_DATA(tid+7);

 	// keep new feature distance if it is the minimum

  	index = ( (val < minVal)? (m+1) : index );

  	minVal = ( (val < minVal)? val : minVal );

  }

  __syncthreads();

When running the programm, then the following happens:

if the code runns on the GPU (tested with GTX280,8800Ultra,8600GT with 177.13 and cuda1.1 cudasdk1.1 on 32bit linux with g++ 4.1), then the unrolled kernel gives the correct result where the rolled kernel give a wrong result.

if run in emulation mode both give the correct result.

(this was done by comparing the results to the cpu version of the code)

the entire code can be found here:

Makefile

bug.cu

you need to edit the first 2 lines of the Makefile to point to your cuda directory and cuda sdk.

Can anyone tell me if they spotted a mistake in my code? Or is this possible a real compiler bug? All comments are greatly appreciated, and i would be glad is someone could test this on their maschines aswell

Without looking at you code…

Why don’t you use CUBLAS and perform a Matrix-Matrix multiplication. The resultin 1024x1024 matrix would only take 4MB on the GPU… No big deal.
Paired with a max search on the rows you would have your best match.
And you wouldn’t even have to develop you own kernel.

Well we use squared differences and not correlation, so i don’t see how to apply matrix matrix multiplication (but i guess there might be some smart formulation to change that). But more importantly the real life scenario we actually have up to 1 Million DB features (so the GPU memory is essentially full) and there is no way in hell to store the resulting matrix, to then get the max / min.

THE POINT I REALLY TRY TO MAKE HERE, IS THAT ESSENTIALLY EQUVALENT CODE (just a manual unroll) GIVES DIFFERENT RESULTS!