Need advice to speed-up my code

garciav · February 15, 2008, 9:04am

Hi,

I’d like to optimize my code. So, I describe my problem below :

Let “ref” be a matrix of size “ref_nb * dim” (rows*cols).

Let “query” be a matrix of size “query_nb * dim”.

I’d like to compute the euclidean between each col of ref with each col of query.

Thus, I define a matrix “dist” of size “ref_nb * query_nb” where the value contained

in the col X and the rox Y correspond to the distance beetween the Xth col of ref

and the Yth col of query. For speed-up my code, I compute the square of the euclidean distance.

I paste the code below :

// CUDA fonction

__global__ void computeDistance(float* ref, int ref_nb, float* query, int query_nb, int dim, float* dist)

{

	// Declaration

    unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

    unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

	float ssd = 0;    // Sum of square differences

	float val;    	// Temp value

	

	// Compute all the distances

	if (xIndex<ref_nb && yIndex<query_nb){

  for (int j=0; j<dim; j++){

  	val = ref[j*ref_nb+xIndex]-query[j*query_nb+yIndex];

  	ssd+=val*val;

  }

  dist[yIndex*ref_nb + xIndex]=ssd;

	}

}

I think that the read of “ref” and the write in “dist” are coalesced.

But the read in “query” is not coalesced. I can’t define a shared memory

to manage this problem because the dimensions are not constant.

Do you think that my code is good or do you have an idea to speed up this?

Thanks for your help :)

Vince

S.Warris · February 15, 2008, 1:28pm

This problem would really benefit from the reduction algorithm (see the SDK).

MisterAnderson42 · February 17, 2008, 3:43pm

What is dim? If it’s small (i.e., less than 10 or 20), the reduction algorithm won’t help: you are going to be limited by memory performance.

Query is not coalesced because it is indexed with threadIdx.y. A 2D texture would be an option if you can’t transpose query to get coalesced reads. One way to verify if you are getting coalesced reads with the profiler.

Another way to analyze the performance is to count the number of memory reads/writes, time the code and calculate the number of GiB/s transfer you get. 70 should be achievable.