Hi people!

I’m writing a program with CUDA and the problem is the following:

-two matrices A (m * 128) and B (n * 128)

-i take the first row of A, and i compute the distance between that vector and all the rows of B, one by one.

- i write the result of each distance on a row of a matrix C, so the element C(i,j) of C contains the distance between row i of A and row j of B.

-and i proceed with the next row of A.

with other functions then i find the minimum distance, but the problem is that the kernel above is too long.

I’ve implemented it this way=

I’ve got a grid made by ( m * (n / 1024) ) blocks

and 1024 threads per block.

( i’ve put every type of guards don’t worry)

(the matrices are already in global memory)

and i launch the following kernel:

```
__global__ void distance_kernel(float* d_C, float* d_A, float* d_B, int m, in n){
__shared__ float common_row[128];
//all the threads in a block have the same row of A
int bx=blockIdx.x;
int by=blockIdx.y;
int tx= threadIdx.x;
int NUM_COL=128;
//A and B have the same number of col.
if(tx < NUM_COL){
common_row[tx] = d_A[bx*NUM_COL+tx];
}
//load row of A in shared memory
__syncthreads();
//for loop for the distances : pseudocode
for(int i = 0; i< NUM_COL; i++){
d_C+ = ( common_row[i]-b[element])^2;
}
d_C=sqrt(d_C);
__syncthreads();
}
```

this kernel takes 4 seconds with matrices (7000*128) and (4000*128)

the question is how can i improve speed??

distance_kernel.pdf (23.2 KB)