Pairwise euclidean distance

Hi everybody.
I am looking for a way to calculate pairwise euclidean distances between datasets or inside a dataset.
By this I mean given two sets of features (x1,…,xn), (y1,…,ym) compute the matrix D where Dij=dist(xi,yj).
The application is to compute RBF kernels, nearest Neighbours or K-Means clustering.

I tried to implement a kernel similar to the matrix multiplication in the SDK, which is reasonably fast.
But as Alex Smola posted in his blog, an easy alternative method
is using the second binomial formular and a matrix product.
Using the matrix product in CUBLAS, I get an algorithm that is 3 times faster than my kernel.
I was wondering if there is a way to implement this computation that computes the distances directly,
without computing the matrix multiplication first, that is at least as fast as using the CUBLAS
matrix product.

Any comment on that would be helpful.


matrix multiplcation C := A*B

for i = 1:m

   creg = 0 ;

   for j = 1:n

       for k = 1:s

           creg += A(i,k)*B(k,j)


       C(i,j) = creg 



pairwise euclidean distances Dij=dist(xi,yj)

Suppose X = [x1 x2, … xm] is a matrix and Y = [y1, y2, …, yn] is a matrix

for i = 1:m

   dreg = 0 ;

   for j = 1:n

       for k = 1:s

           dreg += (X'(i,k)-Y(k,j))*(X'(i,k)-Y(k,j))


       D(i,j) = dreg 



In my opinion, you just replace FMA “creg += A(i,k)*B(k,j)” by MUL+FMA

r0 := (X'(i,k)-Y(k,j));

dreg = r0*r0 + dreg;

Then you can use same configuration of gemm to achieve peak performance,

but you need to adopt MAGMA’s configuration, it can reach 300Gflops on dgemm.

Don’t use configuraiton in matrix multiplcation of SDK, it is too slow.

Thanks for the reply. I’ll try out the magma matrix multiplication!

method 1:

Alex Smola uses formula |x(i) - y(j)| = |x(i)|^2 + |y(j)|^2 - 2<x(i), y(j)>, this can reach optimal complexity.

Suppose X =[x1, x2, …, xm] is a kxm matrix and Y = [y1, y2, …, yn] is a kxn matrix. (x(i) and y(j) are k-dimensional vectors)


(1) compute |x(i)|^2 for i = 1,2, … m

this is memory-bound, it needs to read m*k floats and write m floats.

so complexity is (m+1)k floats

(2) compute |y(j)|^2 for j = 1,2, … n

same as (1), complexity is (n+1)k floats

(3) compute <x(i), y(j)> by sgemm X’*Y

this is computational-bound, complexity is mnk FMA.

(4) merge |x(i) - y(j)| = |x(i)|^2 + |y(j)|^2 - 2<x(i), y(j)>

this is memory-bound, needs to read mn floats and write mn floats

assume we run on C2050, which has 120GB/s bandwidth (ECC on) and 515 Giga instructions per second.

120GB/s is equivalent to 30 Giga floats per second.

So 1 float r/w is the same as 515/30~17 instructions.

Total cost of memory access = (2mn + (m+1)k + (n+1)k) floats, or equivalent to (2mn + (m+1)k + (n+1)k)*17 instructions

If we use number of instructions to do apple-to-apple comparison, then

Total cost = mnk + 34mn instructions (neglect low order term)

method 2: adopts MAGMA’ configuration to do |x(i) - y(j)| directly, then

total cost = 2*mnk instructions (mnk MUL and mnk FMA)

Hence if k is not large, for example, k < 50, then method 2 may beat method 1.

In general, method 2 is 2x slower than method 1 at most.

You say method 1 is 3x faster than your kernel,

I think this is because you don’t adopt MAGMA’s configuration.

Remark: dgemm can achieve 300Gflops, but if k is much smaller than m and n, then dgemm may not reach 300Gflops.

At this time, efficiency of method 2 is better than method 1.

What is typical size of your app? (m,n,k) = ?

Thanks for your very detailed reply. This really clears things up.

There are many things where I would use this, but the most interesting one for me at the moment is doing K-kmeans on visual descriptors.

This means k is between 128 (sift feature vector) and maybe 768 (some larger features).

The number of clusters =n is typically between 500 and 1000 and the number of examples = m is “big” which probably means 10000, 100000 or whatever fits into memory.

So I guess I will adopt method 1 in this setup.

Since this requires the least work on my part, that’s not to bad ;)

Thanks again,


under your (m,n,k), dgemm can reach 240~280Gflops. So method 1 is a good choice, especially k is bigger than 50.