Euclidian Distance between matrices how to find the quikiest way?

biofag · February 5, 2011, 9:52pm

Hi people!

I’m writing a program with CUDA and the problem is the following:

-two matrices A (m * 128) and B (n * 128)

-i take the first row of A, and i compute the distance between that vector and all the rows of B, one by one.

i write the result of each distance on a row of a matrix C, so the element C(i,j) of C contains the distance between row i of A and row j of B.

-and i proceed with the next row of A.

with other functions then i find the minimum distance, but the problem is that the kernel above is too long.

I’ve implemented it this way=

I’ve got a grid made by ( m * (n / 1024) ) blocks

and 1024 threads per block.

( i’ve put every type of guards don’t worry)

(the matrices are already in global memory)

and i launch the following kernel:

__global__ void distance_kernel(float* d_C, float* d_A, float* d_B, int m, in n){

__shared__ float common_row[128];

//all the threads in a block have the same row of A

int bx=blockIdx.x;

int by=blockIdx.y;

int tx= threadIdx.x;

int NUM_COL=128;

//A and B have the same number of col.

if(tx < NUM_COL){

  common_row[tx] = d_A[bx*NUM_COL+tx];

}

//load row of A in shared memory

__syncthreads();

//for loop for the distances : pseudocode

for(int i = 0; i< NUM_COL; i++){

d_C+ = ( common_row[i]-b[element])^2;

}

d_C=sqrt(d_C);

__syncthreads();

}

this kernel takes 4 seconds with matrices (7000128) and (4000128)

the question is how can i improve speed??
distance_kernel.pdf (23.2 KB)

avidday · February 6, 2011, 5:15am

Let’s try the actual code (why didn’t you just post it?)

__global__ void distance_kernel(float* d_C, float* d_A, float* d_B, int m, int n)

{

     __shared__ float main_row[128];

int bx = blockIdx.x;

    int by = blockIdx.y;

    int tx= threadIdx.x;

    int NUM_COL=128;

if(tx<NUM_COL){

        main_row[tx] = d_A[bx*NUM_COL+tx];

    }

    __syncthreads();

if (((by*1024)+tx)<n) {

        d_C[bx*n+(by*1024)+tx]=0;

        for(int i=0;i<NUM_COL;i++){

            d_C[bx*n+(by*1024)+tx]+=(main_row[i]-d_B[((by*1024)+tx)*NUM_COL+i])*(main_row[i]-d_B[((by*1024)+tx)*NUM_COL+i]); 

        }

        d_C[bx*n+(by*1024)+tx]=sqrt(d_C[bx*n+(by*1024)+tx]);

    }

}

So you have (NUM_COL+2) global memory stores and NUM_COL uncoalesced global memory loads per distance calculation. That is why the code is slow.

devkec · February 6, 2011, 7:59am

yeah this is one issue, a “C_shared” would do the trick

another important fact is, if you aren’t really looking for the euclidian distance, but just for minimun of it…why compute it totally? take the square root out. it takes 8 times longer to compute than a add/multiply and since square root doesnt change the order, it is not necessary

biofag · February 6, 2011, 1:50pm

thanks for the reply, so for C_shared i have to create a new shared array in the kernel shared float C[col_number_of_d_C]

and write there the results of my euclidian distance?

and for coalescence access i have just to modify the access to the vector is that right?

and beyond this two very important changes how can i improve further the execution speed?

avidday · February 6, 2011, 2:44pm

Use a register variable instead. There is no need to use shared memory for the accumulation, it is just a waste of resources and will be slower.

Yes. Load d_B into shared memory in a similar way to what is already being done for d_A, then have the distance calculation load only from shared memory.

biofag · February 6, 2011, 3:48pm

ok but the problem is writing back to d_C, it’s still expensive isn’t it?

My goal is to have in row a of C all the euclidian distances between row a of A and all the rows in B.

then i have to get the minimum distance for each row, but there are lots of good algorithm to do so.

any other suggestion apart from those before?

thank you so much!

avidday · February 6, 2011, 4:40pm

Yes, but it is inevitable. By doing as I suggest you are reducing the number of writes per distance from 130 to 1. That will yield very large performance improvements. You can coalesce the writes as well, but that is probably not going to give as bigger benefit as fixing some of the wasteful global memory accesses the original code contains.

biofag · February 6, 2011, 7:25pm

i’ll use a register for the sum then, and i’ll load B to shared memory.

tomorrow i’ll be back in lab, and i’ll try this.

Do you have any other ideas to improve speed?

I’ve seen some directive like #pragma unroll, or directives to the compiler do you think that some of this may be useful?

thanks again for reply

Domokoen · February 6, 2011, 7:39pm

I can tell you, you should be able to do it in less than 0.4s. See the technique in this paper about speeding up Vector Quantization (to be published February 2011). Internally, it uses the Euclidean distance. The distance can be rewritten to use matrix multiplication, and then you can use cuBLAS and get a big gain. I hope to post a stand-alone example of this on my website soon.

Disclaimer: I am the author of the paper

biofag · February 7, 2011, 12:15pm

that would be awesome thank you

I’m reading your paper, and i think i’ve improved a lot my code.

But i’m sure that i can do better than this.

i’ll post the new code this afternoon.

biofag · February 7, 2011, 12:29pm

and another stupid question:

if i have a matrix in global memory and i have to put in shared memory how can i obtain coalescence?

and: my numbers in matrices are between 0 and 255, may i use another type of number to improve performance? (i’m using float)

biofag · February 7, 2011, 1:27pm

ok so i have 2 matrices A[m128] and B[n128]

blocks= [m*n]

threads = [1*128]

matrix C for temp results.

__global__ void distance_kernel2(float *d_C, float *d_A, float *d_B,

		int row_A, int row_B) {

	int elementN = 128;

	__shared__ float accumResult[ACCUM_N];

	__shared__ float sA[ACCUM_N];

	__shared__ float sB[ACCUM_N];

	int bx=blockIdx.x;

	int by= blockIdx.y;

	int tx=threadIdx.x;

	int ty=threadIdx.y;

	float sum = 0;

	//non coalesced?

	sA[ty]=d_A[bx*128+ty];

	sB[ty]=d_B[(by)*ACCUM_N+ty];

__syncthreads();

accumResult[ty] = (sA[ty] - sB[ty])*(sA[ty] - sB[ty]);

	__syncthreads();

		for (int stride = ACCUM_N / 2; stride > 0; stride >>= 1) {

			if (ty < stride)

				accumResult[ty]	+= accumResult[stride+ ty];

		}

		// Perform tree-like reduction of accumulators' results.

		__syncthreads();

		if ((threadIdx.y == 0))

			d_C[(bx * righe_B) + (by)] = accumResult[ty];

		__syncthreads();

	}

this is far more faster than the previous one, but how can i do better?

thanks to everibody, I’m looking forward for your next replies!

may i use some short int or short float to improve performance? (values are between 0 and 255)

LSChien · February 7, 2011, 2:13pm

for example, suppose tile is 4 x 4

float a[4] ;

float b[4] ;

int bx = 4*blockIdx.x;        

int by = 4*blockIdx.y;        

int tx = threadIdx.x;        

#pragma unroll

for(int j = 0 ; j < 4 ; j++){

    a[j] = d_A[(bx+j)*128+tx];        

    b[j] = d_B[(by+j)*ACCUM_N+tx]; 

}

__shared__ float accumResult[ACCUM_N]; 

__shared__ float c_smem[4][4];

for(int i = 0 ; i < 4 ; i++){

    for(int j = 0 ; j < 4; j++){

        float c = (a[i] - b[j]) ; 

        accumResult[tx] = c*c ;

        __syncthreads();

        tree-reduction

        __syncthreads(); 

        if ( 0 == tx ){

           c_smem[i][j] = accumResult[0];  

        }

    }

}

// write c_smem to C

int gid = tx >> 2 ;

int lane = tx & (4-1);

if ( tx < 16 ){

    d_C[(bx+gid) * righe_B + (by+lane)] = c_smem[gid][lane] ;

}

configuration is

dim3 threads(128,1) ;

dim3 grids((m+3)/4,(n+3)/4) ;

You can refer Volkov’s talk

2238 Better Performance at Lower Occupancy

http://www.nvidia.com/object/gtc2010-presentation-archive.html

biofag · February 8, 2011, 12:27am

thank you! your code made me gain something like 15% of performance!

Is there still a way to make it better?

@Domoken: i’ve read your paper, do you think that with that type of euclidian distance i can do better than this?

sorry if i’m telling u this questions but i can try by myself CUDA programming only for a few of hours in the morning

LSChien · February 8, 2011, 1:29am

I think that you should ask another question,

what is peak performance you expect and what is the performance now?

tile 4x4 is not the best.
your application is the same as BLAS NT (A is non-transpose but B is transpose).

you should try typical gemm implementation without tree reduction because reduction

is expensive.

biofag · February 8, 2011, 11:18am

ok so

a tile 16*16 would be more efficient?
BLAS NT: what does NT stands for? I didn’t found it in CUBLAS library
could you suggest me a method to avoid tree reduction?

avidday · February 8, 2011, 11:43am

16 in the major direction of the storage gives you coalesced reads if you structure the code correctly.

NT = A in normal order, B in transposed order. There are effectively 4 gemm variants: NN, NT, TN, TT

Just use the standard nÃ¯ave dot product implementation you would see in any textbook.

biofag · February 9, 2011, 12:07am

thank you again!

could you explain me the semantic of this lines of code?

thank you!

int gid = tx >> 2 ;

int lane = tx & (4-1);

LSChien · February 9, 2011, 1:06am

tx = 4 * gid + lane, 0 <= lane < 4

biofag · February 9, 2011, 10:01pm

and what does means this operator “>>=”?

Topic		Replies	Views
Euclidean distance between 2 matrices using CUDA CUDA Programming and Performance	11	6764	August 17, 2015
Simple Kernel, but having problems...ideas CUDA Programming and Performance	16	7923	April 24, 2011
matrix multiply reduction CUDA Programming and Performance	41	35540	January 15, 2011
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3309	January 10, 2010
Matrix Multiplication Garbage value :( CUDA Programming and Performance	10	3404	July 25, 2009
Euclidian distance and synchronization CUDA Programming and Performance	18	20732	July 13, 2007
Optimizing calculation question Euclidean distance between arrays CUDA Programming and Performance	5	5337	March 1, 2010
How would you do this? CUDA Programming and Performance	12	4466	August 5, 2008
Fast CUDA implementation for calculating cross-norm-distance of two matrices CUDA Programming and Performance cuda	0	925	April 8, 2021
Fastest matrix-vector multiplication? CUDA Programming and Performance	24	3883	May 21, 2011

Euclidian Distance between matrices how to find the quikiest way?

Related topics