 # Need advice to speed-up my code

Hi,

I’d like to optimize my code. So, I describe my problem below :

Let “ref” be a matrix of size “ref_nb * dim” (rows*cols).

Let “query” be a matrix of size “query_nb * dim”.

I’d like to compute the euclidean between each col of ref with each col of query.

Thus, I define a matrix “dist” of size “ref_nb * query_nb” where the value contained

in the col X and the rox Y correspond to the distance beetween the Xth col of ref

and the Yth col of query. For speed-up my code, I compute the square of the euclidean distance.

I paste the code below :

``````// CUDA fonction

__global__ void computeDistance(float* ref, int ref_nb, float* query, int query_nb, int dim, float* dist)

{

// Declaration

unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

float ssd = 0;    // Sum of square differences

float val;    	// Temp value

// Compute all the distances

if (xIndex<ref_nb && yIndex<query_nb){

for (int j=0; j<dim; j++){

val = ref[j*ref_nb+xIndex]-query[j*query_nb+yIndex];

ssd+=val*val;

}

dist[yIndex*ref_nb + xIndex]=ssd;

}

}
``````

I think that the read of “ref” and the write in “dist” are coalesced.

But the read in “query” is not coalesced. I can’t define a shared memory

to manage this problem because the dimensions are not constant.

Do you think that my code is good or do you have an idea to speed up this?