Euclidean distance between 2 matrices using CUDA

djeb · June 5, 2014, 12:21pm

Hello people !

I’m writing a program with CUDA and the problem is the following:

Two matrices A (n * 128) and B (m * 128)
I take the first row of A, and I compute the distance between that vector and all the rows of B, one by one.
I write the result of each distance on a row of a matrix C, so the element C(i,j) of C contains the distance between row i of A and row j of B.

-and I proceed with the next row of A.

I’ve implemented it this way:
I’ve got a grid made by ( n * m ) blocks, and 128 threads per block. ( 1 * 128 ).

The program is compiling, but the problem is that it doesn’t gives good distances.
I can’t figure out what wrong…

__global__ void EuclidianDistances( float *A, float *B , float *C , int n , int m)
{
        // SIZE is equal to 128
	__shared__ float accumResult;
	__shared__ float sA;
	__shared__ float sB;

        // MAPPING
	int bx = blockIdx.x;  // n
	int by = blockIdx.y;  // m
	int ty = threadIdx.y; // 128
	int tx = threadIdx.x; // 1

	sA[ty] = A [bx * SIZE + ty];
	sB[ty] = B [by * SIZE + ty];
	__syncthreads();

	accumResult[ty] = (sA[ty] - sB[ty])*(sA[ty] - sB[ty]);
	__syncthreads();

	// Parallel tree-reduction
	for (int stride = SIZE/2 ; stride < 0 ; stride >>= 1)
		if (ty < pas)
			accumResult[ty]	+= accumResult [stride + ty];
	__syncthreads();

        // Writing results to output matrix
	if ((threadIdx.y == 0))
		C [bx * m + by] = accumResult[ty];
	__syncthreads();
}

Gert-Jan · June 5, 2014, 2:04pm

Just a few ideas:

the reduction probably isn’t working, stride will never be <0, so the for-loop will never execute. And you need the syncthreads() inside the loop, all accumResult need to be updated before you can do the next iteration
you can make sA and sB regular variables, they don’t have to be shared memory arrays for your implementation
the last syncthreads() (on line 33) can be removed

djeb · June 5, 2014, 3:01pm

Thanks for the reply. Actually, I declared sA, sB, and accumResult as shared memory arrays to accelerate the kernel execution time (10 times faster). Shared memory allow the 128 threads of a block to access data very quickly.

You’re right, the syncthread() should be inside the for loop.
I changed the for stop condition to stride < 1, and removed the last useless syncthreads().

The program is still giving me the same bad distance results.

djeb · June 5, 2014, 3:54pm

I just solved my problem. I just replaced (stride < 0) to (stride > 0). The glitch was obvious and stupid.

Gert-Jan · June 5, 2014, 4:10pm

Good to hear things are working now!

But I still think sA and sB don’t have to be in shared memory. Every element, indexed by ty, is only used by one thread. The thread which is writing the element at ty is also reading it, but no other threads are using this element. I would change lines 15 and 16 to:

float sA = A [bx * SIZE + ty];
float sB = B [by * SIZE + ty];

I guess this will even be a little bit faster, as you save some writes and reads to shared memory

djeb · June 11, 2014, 12:57pm

When I remove sA and sB from shared memory, the results are no longer the same, is that normal? Moreover, it’s a bit slower. I’ve jut removed the shared declaration from line 4, 5 and 6.

Gert-Jan · June 11, 2014, 1:53pm

You should remove shared only at line 5 and 6, but keep it at line 4. Line 4,5,6 should be something like:

__shared__ float accumResult;
float sA;
float sB;

djeb · June 11, 2014, 2:16pm

I got good results but the execution time is twice higher than before…
Here are my declarations:

__shared__ float accumResult;
float sA ;
float sB ;

Gert-Jan · June 11, 2014, 2:20pm

sA and sB should not be arrays, just regular variables as in my previous post. In the rest of the code you should replace “sA[ty]” by sA and “sB[ty]” by “sB”.

djeb · June 11, 2014, 2:25pm

Indeed its a bit faster. 50 ms gained! Thank you :)

djeb · June 11, 2014, 2:34pm

I also heard that some people used to declare variable in register rather than in global and/or shared. Would that worth it in this case?

djeb · August 17, 2015, 4:17pm

Final code here : https://github.com/djebm2/AACV/blob/master/gpu/cuda/kernels/bruteforce.cu

Please, do not hesitate to upgrade it.

Topic		Replies	Views
Euclidian Distance between matrices how to find the quikiest way? CUDA Programming and Performance	21	23871	February 11, 2011
Optimizing calculation question Euclidean distance between arrays CUDA Programming and Performance	5	5342	March 1, 2010
Euclidian distance and synchronization CUDA Programming and Performance	18	20733	July 13, 2007
local block thread index and global thread index using __shared__ variables ? CUDA Programming and Performance	4	2952	July 12, 2014
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	10004	January 18, 2012
why result varied based on different number of threads per block? CUDA Programming and Performance	8	1940	March 1, 2011
Help a newbie with CUDA development! CUDA Programming and Performance	16	14552	May 1, 2008
Using shared memory issues CUDA Programming and Performance	3	339	April 27, 2019
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7725	July 22, 2010

Euclidean distance between 2 matrices using CUDA

Related topics