Hi,

I’d like to optimize my code. So, I describe my problem below :

Let “ref” be a matrix of size “ref_nb * dim” (rows*cols).

Let “query” be a matrix of size “query_nb * dim”.

I’d like to compute the euclidean between each col of ref with each col of query.

Thus, I define a matrix “dist” of size “ref_nb * query_nb” where the value contained

in the col X and the rox Y correspond to the distance beetween the Xth col of ref

and the Yth col of query. For speed-up my code, I compute the square of the euclidean distance.

I paste the code below :

```
// CUDA fonction
__global__ void computeDistance(float* ref, int ref_nb, float* query, int query_nb, int dim, float* dist)
{
// Declaration
unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
float ssd = 0; // Sum of square differences
float val; // Temp value
// Compute all the distances
if (xIndex<ref_nb && yIndex<query_nb){
for (int j=0; j<dim; j++){
val = ref[j*ref_nb+xIndex]-query[j*query_nb+yIndex];
ssd+=val*val;
}
dist[yIndex*ref_nb + xIndex]=ssd;
}
}
```

I think that the read of “ref” and the write in “dist” are coalesced.

But the read in “query” is not coalesced. I can’t define a shared memory

to manage this problem because the dimensions are not constant.

Do you think that my code is good or do you have an idea to speed up this?

Thanks for your help :)

Vince