Euclidian Distance between matrices how to find the quikiest way?

Hi people!

I’m writing a program with CUDA and the problem is the following:

-two matrices A (m * 128) and B (n * 128)

-i take the first row of A, and i compute the distance between that vector and all the rows of B, one by one.

  • i write the result of each distance on a row of a matrix C, so the element C(i,j) of C contains the distance between row i of A and row j of B.

-and i proceed with the next row of A.

with other functions then i find the minimum distance, but the problem is that the kernel above is too long.

I’ve implemented it this way=

I’ve got a grid made by ( m * (n / 1024) ) blocks

and 1024 threads per block.

( i’ve put every type of guards don’t worry)

(the matrices are already in global memory)

and i launch the following kernel:

__global__ void distance_kernel(float* d_C, float* d_A, float* d_B, int m, in n){

__shared__ float common_row[128];

//all the threads in a block have the same row of A

int bx=blockIdx.x;

int by=blockIdx.y;

int tx= threadIdx.x;

int NUM_COL=128;

//A and B have the same number of col.

if(tx < NUM_COL){

  common_row[tx] = d_A[bx*NUM_COL+tx];


//load row of A in shared memory


//for loop for the distances : pseudocode

for(int i = 0; i< NUM_COL; i++){

d_C+ = ( common_row[i]-b[element])^2;





this kernel takes 4 seconds with matrices (7000128) and (4000128)

the question is how can i improve speed??
distance_kernel.pdf (23.2 KB)

Let’s try the actual code (why didn’t you just post it?)

__global__ void distance_kernel(float* d_C, float* d_A, float* d_B, int m, int n)


     __shared__ float main_row[128];

int bx = blockIdx.x;

    int by = blockIdx.y;

    int tx= threadIdx.x;

    int NUM_COL=128;


        main_row[tx] = d_A[bx*NUM_COL+tx];



if (((by*1024)+tx)<n) {


        for(int i=0;i<NUM_COL;i++){






So you have (NUM_COL+2) global memory stores and NUM_COL uncoalesced global memory loads per distance calculation. That is why the code is slow.

yeah this is one issue, a “C_shared” would do the trick

another important fact is, if you aren’t really looking for the euclidian distance, but just for minimun of it…why compute it totally? take the square root out. it takes 8 times longer to compute than a add/multiply and since square root doesnt change the order, it is not necessary

thanks for the reply, so for C_shared i have to create a new shared array in the kernel shared float C[col_number_of_d_C]

and write there the results of my euclidian distance?

and for coalescence access i have just to modify the access to the vector is that right?

and beyond this two very important changes how can i improve further the execution speed?

Use a register variable instead. There is no need to use shared memory for the accumulation, it is just a waste of resources and will be slower.

Yes. Load d_B into shared memory in a similar way to what is already being done for d_A, then have the distance calculation load only from shared memory.

ok but the problem is writing back to d_C, it’s still expensive isn’t it?

My goal is to have in row a of C all the euclidian distances between row a of A and all the rows in B.

then i have to get the minimum distance for each row, but there are lots of good algorithm to do so.

any other suggestion apart from those before?

thank you so much!

Yes, but it is inevitable. By doing as I suggest you are reducing the number of writes per distance from 130 to 1. That will yield very large performance improvements. You can coalesce the writes as well, but that is probably not going to give as bigger benefit as fixing some of the wasteful global memory accesses the original code contains.

i’ll use a register for the sum then, and i’ll load B to shared memory.

tomorrow i’ll be back in lab, and i’ll try this.

Do you have any other ideas to improve speed?

I’ve seen some directive like #pragma unroll, or directives to the compiler do you think that some of this may be useful?

thanks again for reply

I can tell you, you should be able to do it in less than 0.4s. See the technique in this paper about speeding up Vector Quantization (to be published February 2011). Internally, it uses the Euclidean distance. The distance can be rewritten to use matrix multiplication, and then you can use cuBLAS and get a big gain. I hope to post a stand-alone example of this on my website soon.

Disclaimer: I am the author of the paper

that would be awesome thank you

I’m reading your paper, and i think i’ve improved a lot my code.

But i’m sure that i can do better than this.

i’ll post the new code this afternoon.

and another stupid question:

if i have a matrix in global memory and i have to put in shared memory how can i obtain coalescence?

and: my numbers in matrices are between 0 and 255, may i use another type of number to improve performance? (i’m using float)

ok so i have 2 matrices A[m128] and B[n128]

blocks= [m*n]

threads = [1*128]

matrix C for temp results.

__global__ void distance_kernel2(float *d_C, float *d_A, float *d_B,

		int row_A, int row_B) {

	int elementN = 128;

	__shared__ float accumResult[ACCUM_N];

	__shared__ float sA[ACCUM_N];

	__shared__ float sB[ACCUM_N];

	int bx=blockIdx.x;

	int by= blockIdx.y;

	int tx=threadIdx.x;

	int ty=threadIdx.y;

	float sum = 0;

	//non coalesced?




accumResult[ty] = (sA[ty] - sB[ty])*(sA[ty] - sB[ty]);


		for (int stride = ACCUM_N / 2; stride > 0; stride >>= 1) {

			if (ty < stride)

				accumResult[ty]	+= accumResult[stride+ ty];


		// Perform tree-like reduction of accumulators' results.


		if ((threadIdx.y == 0))

			d_C[(bx * righe_B) + (by)] = accumResult[ty];



this is far more faster than the previous one, but how can i do better?

thanks to everibody, I’m looking forward for your next replies!

may i use some short int or short float to improve performance? (values are between 0 and 255)

for example, suppose tile is 4 x 4

float a[4] ;

float b[4] ;

int bx = 4*blockIdx.x;        

int by = 4*blockIdx.y;        

int tx = threadIdx.x;        

#pragma unroll

for(int j = 0 ; j < 4 ; j++){

    a[j] = d_A[(bx+j)*128+tx];        

    b[j] = d_B[(by+j)*ACCUM_N+tx]; 


__shared__ float accumResult[ACCUM_N]; 

__shared__ float c_smem[4][4];

for(int i = 0 ; i < 4 ; i++){

    for(int j = 0 ; j < 4; j++){

        float c = (a[i] - b[j]) ; 

        accumResult[tx] = c*c ;




        if ( 0 == tx ){

           c_smem[i][j] = accumResult[0];  




// write c_smem to C

int gid = tx >> 2 ;

int lane = tx & (4-1);

if ( tx < 16 ){

    d_C[(bx+gid) * righe_B + (by+lane)] = c_smem[gid][lane] ;


configuration is

dim3 threads(128,1) ;

dim3 grids((m+3)/4,(n+3)/4) ;

You can refer Volkov’s talk

2238 Better Performance at Lower Occupancy

thank you! your code made me gain something like 15% of performance!

Is there still a way to make it better?

@Domoken: i’ve read your paper, do you think that with that type of euclidian distance i can do better than this?

sorry if i’m telling u this questions but i can try by myself CUDA programming only for a few of hours in the morning

I think that you should ask another question,

what is peak performance you expect and what is the performance now?

  1. tile 4x4 is not the best.

  2. your application is the same as BLAS NT (A is non-transpose but B is transpose).

    you should try typical gemm implementation without tree reduction because reduction

    is expensive.

ok so

  1. a tile 16*16 would be more efficient?
  2. BLAS NT: what does NT stands for? I didn’t found it in CUBLAS library
  3. could you suggest me a method to avoid tree reduction?

16 in the major direction of the storage gives you coalesced reads if you structure the code correctly.

NT = A in normal order, B in transposed order. There are effectively 4 gemm variants: NN, NT, TN, TT

Just use the standard nïave dot product implementation you would see in any textbook.

thank you again!

could you explain me the semantic of this lines of code?

thank you!

int gid = tx >> 2 ;

int lane = tx & (4-1);

tx = 4 * gid + lane, 0 <= lane < 4

and what does means this operator “>>=”?