Optimizing calculation question Euclidean distance between arrays

computervision · February 27, 2010, 2:22pm

Hi there,

I got this Problem:

I have an array of x times 64 integers cells. Each 64-wide block can be understood as a 64-Dimensional Vector.

I have to calculate the euclidean distance of all x vectors to each others (sum over all squared distances).

My x is about 120’000 and I have a Kernel, that does the calculation FOR ONE VECTOR and stores all the distances to the other vectors into an array. Per vector it takes about 2ms which I think of beeing suboptimal.

I have no further ideas where I could tune my kernel so I post it here. Perhaps someone has the one or the other idea.

Every hint is welcome!

here is the Kernel - some words:

I have a 2-Dimensional GridDim and a 2-Dimensional BlockDim. The grid dimension doesn’t matter here. But BlockDim.x=64 and BlockDim.y=8 so there are 512 threads per Block and the 64 equates the length of each vector.

__global__ void square_sum(int *a, int *b, int candidate, int N)

{

	extern __shared__ int s_data[];

	__shared__ int s_candidate[64];

	

	unsigned int blockId = gridDim.y*blockIdx.x+blockIdx.y;

	unsigned int threadId = threadIdx.y*blockDim.x+threadIdx.x;

	unsigned int idx = blockId*blockDim.x*blockDim.y + threadId;

	

	if (idx < N){

		if (threadIdx.y==0)

			s_candidate[threadIdx.x] = a[64*candidate + threadIdx.x];

		__syncthreads();

		int diff= a[idx]-s_candidate[threadIdx.x];

		//s_data[threadId]=(a[idx]-s_candidate[threadIdx.x])*(a[idx]-s_candidate[threadIdx.x]);

		s_data[threadId]=diff*diff;

		__syncthreads();

	}

	//int s=32;

	if (threadIdx.x<32)

		s_data[threadId]=s_data[threadId]+s_data[threadId+32];

	//__syncthreads();

	//s=16;

	if (threadIdx.x<16)

		s_data[threadId]=s_data[threadId]+s_data[threadId+16];

	//__syncthreads();

	//s=8;

	if (threadIdx.x<8)

		s_data[threadId]=s_data[threadId]+s_data[threadId+8];

	//__syncthreads();

	//s=4;

	if (threadIdx.x<4)

		s_data[threadId]=s_data[threadId]+s_data[threadId+4];

	//__syncthreads();

	//s=2;

	if (threadIdx.x<2)

		s_data[threadId]=s_data[threadId]+s_data[threadId+2];

	//__syncthreads();

	//s=1;

	if (threadIdx.x<1)

		s_data[threadId]=s_data[threadId]+s_data[threadId+1];

	if ((threadIdx.x == 0)) b[blockId*blockDim.y+threadIdx.y] = s_data[blockDim.x*threadIdx.y];	

}

Interesting for me is:

Faster data transfer from and to global memory (takes about 1.8ms) ~30GB/s on a GTX295
Solving possible problems with shared memory access.
Optimizing the sum algorithm if possible.
or is this already optimal?

Thanks in advance

computervision · February 27, 2010, 3:12pm

In the meantime I replaced all multiplications with __umul24() or __mul24() which gave 0.3ms :)

Now I have an Idea: The vector, to which I want to calculate the distances lies in between all the other vectors.
It has to be read from global to shared memory in every single block (there are 14881 blocks) which makes extra 3.6MB of transfer from global to shared memory.
Is there another memory-option where I could put this vector to, so that every thread in every block can quicly read it?

LSChien · February 28, 2010, 1:38am

warp is a basic unit, it is __synch implicitly. you don’t need divergent path for threadIdx.x < 32

your thread block is 64 x 8 and then 0 <= threadIdx.x < 32 becomes a warp.

tyr to modify the code

//int s=32;

	if (threadIdx.x<32)

		s_data[threadId]=s_data[threadId]+s_data[threadId+32];

	//__syncthreads();

	//s=16;

	if (threadIdx.x<16)

		s_data[threadId]=s_data[threadId]+s_data[threadId+16];

	//__syncthreads();

	//s=8;

	if (threadIdx.x<8)

		s_data[threadId]=s_data[threadId]+s_data[threadId+8];

	//__syncthreads();

	//s=4;

	if (threadIdx.x<4)

		s_data[threadId]=s_data[threadId]+s_data[threadId+4];

	//__syncthreads();

	//s=2;

	if (threadIdx.x<2)

		s_data[threadId]=s_data[threadId]+s_data[threadId+2];

	//__syncthreads();

	//s=1;

	if (threadIdx.x<1)

		s_data[threadId]=s_data[threadId]+s_data[threadId+1];

to

//int s=32;

   if (threadIdx.x<32){

		s_data[threadId]=s_data[threadId]+s_data[threadId+32];

		s_data[threadId]=s_data[threadId]+s_data[threadId+16];

		s_data[threadId]=s_data[threadId]+s_data[threadId+8];

		s_data[threadId]=s_data[threadId]+s_data[threadId+4];

		s_data[threadId]=s_data[threadId]+s_data[threadId+2];

		s_data[threadId]=s_data[threadId]+s_data[threadId+1];

   }

you can save divergent cost and __synch cost.

computervision · February 28, 2010, 10:28pm

Hi LSChien,

that was a good hint! My latest time per run was

1.76ms

the new one is

1.36ms

thats nice! Thank you very much.

Another question that arose:

Is there an explanation, why that problem cannot be solved as good as other problems by cuda?

Julian

Edit: I got another 0.15ms per run by changing the blocksize from 512threads to 256 threads

LSChien · March 1, 2010, 2:40am

really?

512 threads per block : 1.36 ms

256 threads per block : 0.15 ms

This is strange.

since your array is 1-dimension, I will suggest that

change

unsigned int blockId = gridDim.y*blockIdx.x+blockIdx.y;

to

unsigned int blockId = gridDim.x*blockIdx.y + blockIdx.x;

such that adjacent thread blocks access contiguous memory block.

your problem is memory-intensive, performance is limited by bandwidth.

bandwidth of GTX295 is 112GB/sec (only use one GPU)

In your problem, you need to load 120,000 64-integer from global memory into shared memory.

performance is bounded by “load data into shared memory”

120,000 x 64 x 4 (byte) / (112GB/s) = 0.274 ms

however this number is larger than time of blocksize=256.

Can you describe how do you measure execution time of your kernel?

computervision · March 1, 2010, 11:39am

this was not clear enough. I meant, that I reduced the time per run by 0.15ms so now I’m at 1.21ms/run

EDIT: Measurement:

I use the cutil timer. I start it before kernel invokation, after kernel invokation I do a cudaThreadSync then I stop the timer and print its value.

Topic		Replies	Views
Euclidian distance and synchronization CUDA Programming and Performance	18	20732	July 13, 2007
beginner question Checking if GPU is the answer to me CUDA Programming and Performance	20	12081	September 4, 2008
Optimization opportunity for large vector access CUDA Programming and Performance	9	509	June 11, 2022
Euclidian Distance between matrices how to find the quikiest way? CUDA Programming and Performance	21	23855	February 11, 2011
threads and blocks CUDA Programming and Performance	3	1349	May 7, 2012
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16291	January 30, 2011
Newbie help on thread blocks CUDA Programming and Performance	22	10599	December 24, 2008
HELP with vector sum CUDA Programming and Performance	6	2209	May 11, 2010
how can i calc the correct count of blocks/threads for a variable vector size CUDA Programming and Performance	7	6038	December 17, 2007
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18216	October 17, 2013

Optimizing calculation question Euclidean distance between arrays

Related topics