Hello people !
I’m writing a program with CUDA and the problem is the following:

Two matrices A (n * 128) and B (m * 128)

I take the first row of A, and I compute the distance between that vector and all the rows of B, one by one.

I write the result of each distance on a row of a matrix C, so the element C(i,j) of C contains the distance between row i of A and row j of B.
and I proceed with the next row of A.
I’ve implemented it this way:
I’ve got a grid made by ( n * m ) blocks, and 128 threads per block. ( 1 * 128 ).
The program is compiling, but the problem is that it doesn’t gives good distances.
I can’t figure out what wrong…
__global__ void EuclidianDistances( float *A, float *B , float *C , int n , int m)
{
// SIZE is equal to 128
__shared__ float accumResult;
__shared__ float sA;
__shared__ float sB;
// MAPPING
int bx = blockIdx.x; // n
int by = blockIdx.y; // m
int ty = threadIdx.y; // 128
int tx = threadIdx.x; // 1
sA[ty] = A [bx * SIZE + ty];
sB[ty] = B [by * SIZE + ty];
__syncthreads();
accumResult[ty] = (sA[ty]  sB[ty])*(sA[ty]  sB[ty]);
__syncthreads();
// Parallel treereduction
for (int stride = SIZE/2 ; stride < 0 ; stride >>= 1)
if (ty < pas)
accumResult[ty] += accumResult [stride + ty];
__syncthreads();
// Writing results to output matrix
if ((threadIdx.y == 0))
C [bx * m + by] = accumResult[ty];
__syncthreads();
}