Euclidean distance between 2 matrices using CUDA

Hello people !

I’m writing a program with CUDA and the problem is the following:

  • Two matrices A (n * 128) and B (m * 128)

  • I take the first row of A, and I compute the distance between that vector and all the rows of B, one by one.

  • I write the result of each distance on a row of a matrix C, so the element C(i,j) of C contains the distance between row i of A and row j of B.

-and I proceed with the next row of A.

I’ve implemented it this way:
I’ve got a grid made by ( n * m ) blocks, and 128 threads per block. ( 1 * 128 ).

The program is compiling, but the problem is that it doesn’t gives good distances.
I can’t figure out what wrong…

__global__ void EuclidianDistances( float *A, float *B , float *C , int n , int m)
{
        // SIZE is equal to 128
	__shared__ float accumResult;
	__shared__ float sA;
	__shared__ float sB;

        // MAPPING
	int bx = blockIdx.x;  // n
	int by = blockIdx.y;  // m
	int ty = threadIdx.y; // 128
	int tx = threadIdx.x; // 1

	sA[ty] = A [bx * SIZE + ty];
	sB[ty] = B [by * SIZE + ty];
	__syncthreads();

	accumResult[ty] = (sA[ty] - sB[ty])*(sA[ty] - sB[ty]);
	__syncthreads();

	// Parallel tree-reduction
	for (int stride = SIZE/2 ; stride < 0 ; stride >>= 1)
		if (ty < pas)
			accumResult[ty]	+= accumResult [stride + ty];
	__syncthreads();

        // Writing results to output matrix
	if ((threadIdx.y == 0))
		C [bx * m + by] = accumResult[ty];
	__syncthreads();
}

Just a few ideas:

  • the reduction probably isn’t working, stride will never be <0, so the for-loop will never execute. And you need the syncthreads() inside the loop, all accumResult need to be updated before you can do the next iteration
  • you can make sA and sB regular variables, they don’t have to be shared memory arrays for your implementation
  • the last syncthreads() (on line 33) can be removed

Thanks for the reply. Actually, I declared sA, sB, and accumResult as shared memory arrays to accelerate the kernel execution time (10 times faster). Shared memory allow the 128 threads of a block to access data very quickly.

You’re right, the syncthread() should be inside the for loop.
I changed the for stop condition to stride < 1, and removed the last useless syncthreads().

The program is still giving me the same bad distance results.

I just solved my problem. I just replaced (stride < 0) to (stride > 0). The glitch was obvious and stupid.

Good to hear things are working now!

But I still think sA and sB don’t have to be in shared memory. Every element, indexed by ty, is only used by one thread. The thread which is writing the element at ty is also reading it, but no other threads are using this element. I would change lines 15 and 16 to:

float sA = A [bx * SIZE + ty];
float sB = B [by * SIZE + ty];

I guess this will even be a little bit faster, as you save some writes and reads to shared memory

When I remove sA and sB from shared memory, the results are no longer the same, is that normal? Moreover, it’s a bit slower. I’ve jut removed the shared declaration from line 4, 5 and 6.

You should remove shared only at line 5 and 6, but keep it at line 4. Line 4,5,6 should be something like:

__shared__ float accumResult;
float sA;
float sB;

I got good results but the execution time is twice higher than before…
Here are my declarations:

__shared__ float accumResult;
float sA ;
float sB ;

sA and sB should not be arrays, just regular variables as in my previous post. In the rest of the code you should replace “sA[ty]” by sA and “sB[ty]” by “sB”.

Indeed its a bit faster. 50 ms gained! Thank you :)

I also heard that some people used to declare variable in register rather than in global and/or shared. Would that worth it in this case?

Final code here : https://github.com/djebm2/AACV/blob/master/gpu/cuda/kernels/bruteforce.cu

Please, do not hesitate to upgrade it.