Optimizing calculation question Euclidean distance between arrays

Hi there,

I got this Problem:

I have an array of x times 64 integers cells. Each 64-wide block can be understood as a 64-Dimensional Vector.

I have to calculate the euclidean distance of all x vectors to each others (sum over all squared distances).

My x is about 120’000 and I have a Kernel, that does the calculation FOR ONE VECTOR and stores all the distances to the other vectors into an array. Per vector it takes about 2ms which I think of beeing suboptimal.

I have no further ideas where I could tune my kernel so I post it here. Perhaps someone has the one or the other idea.

Every hint is welcome!

here is the Kernel - some words:

I have a 2-Dimensional GridDim and a 2-Dimensional BlockDim. The grid dimension doesn’t matter here. But BlockDim.x=64 and BlockDim.y=8 so there are 512 threads per Block and the 64 equates the length of each vector.

__global__ void square_sum(int *a, int *b, int candidate, int N)

{

	extern __shared__ int s_data[];

	__shared__ int s_candidate[64];

	

	unsigned int blockId = gridDim.y*blockIdx.x+blockIdx.y;

	unsigned int threadId = threadIdx.y*blockDim.x+threadIdx.x;

	unsigned int idx = blockId*blockDim.x*blockDim.y + threadId;

	

	if (idx < N){

		if (threadIdx.y==0)

			s_candidate[threadIdx.x] = a[64*candidate + threadIdx.x];

		__syncthreads();

		int diff= a[idx]-s_candidate[threadIdx.x];

		//s_data[threadId]=(a[idx]-s_candidate[threadIdx.x])*(a[idx]-s_candidate[threadIdx.x]);

		s_data[threadId]=diff*diff;

		__syncthreads();

	}

	//int s=32;

	if (threadIdx.x<32)

		s_data[threadId]=s_data[threadId]+s_data[threadId+32];

	//__syncthreads();

	//s=16;

	if (threadIdx.x<16)

		s_data[threadId]=s_data[threadId]+s_data[threadId+16];

	//__syncthreads();

	//s=8;

	if (threadIdx.x<8)

		s_data[threadId]=s_data[threadId]+s_data[threadId+8];

	//__syncthreads();

	//s=4;

	if (threadIdx.x<4)

		s_data[threadId]=s_data[threadId]+s_data[threadId+4];

	//__syncthreads();

	//s=2;

	if (threadIdx.x<2)

		s_data[threadId]=s_data[threadId]+s_data[threadId+2];

	//__syncthreads();

	//s=1;

	if (threadIdx.x<1)

		s_data[threadId]=s_data[threadId]+s_data[threadId+1];

	if ((threadIdx.x == 0)) b[blockId*blockDim.y+threadIdx.y] = s_data[blockDim.x*threadIdx.y];	

}

Interesting for me is:

  • Faster data transfer from and to global memory (takes about 1.8ms) ~30GB/s on a GTX295

  • Solving possible problems with shared memory access.

  • Optimizing the sum algorithm if possible.

  • or is this already optimal?

Thanks in advance

In the meantime I replaced all multiplications with __umul24() or __mul24() which gave 0.3ms :)

Now I have an Idea: The vector, to which I want to calculate the distances lies in between all the other vectors.
It has to be read from global to shared memory in every single block (there are 14881 blocks) which makes extra 3.6MB of transfer from global to shared memory.
Is there another memory-option where I could put this vector to, so that every thread in every block can quicly read it?

warp is a basic unit, it is __synch implicitly. you don’t need divergent path for threadIdx.x < 32

your thread block is 64 x 8 and then 0 <= threadIdx.x < 32 becomes a warp.

tyr to modify the code

//int s=32;

	if (threadIdx.x<32)

		s_data[threadId]=s_data[threadId]+s_data[threadId+32];

	//__syncthreads();

	//s=16;

	if (threadIdx.x<16)

		s_data[threadId]=s_data[threadId]+s_data[threadId+16];

	//__syncthreads();

	//s=8;

	if (threadIdx.x<8)

		s_data[threadId]=s_data[threadId]+s_data[threadId+8];

	//__syncthreads();

	//s=4;

	if (threadIdx.x<4)

		s_data[threadId]=s_data[threadId]+s_data[threadId+4];

	//__syncthreads();

	//s=2;

	if (threadIdx.x<2)

		s_data[threadId]=s_data[threadId]+s_data[threadId+2];

	//__syncthreads();

	//s=1;

	if (threadIdx.x<1)

		s_data[threadId]=s_data[threadId]+s_data[threadId+1];

to

//int s=32;

   if (threadIdx.x<32){

		s_data[threadId]=s_data[threadId]+s_data[threadId+32];

		s_data[threadId]=s_data[threadId]+s_data[threadId+16];

		s_data[threadId]=s_data[threadId]+s_data[threadId+8];

		s_data[threadId]=s_data[threadId]+s_data[threadId+4];

		s_data[threadId]=s_data[threadId]+s_data[threadId+2];

		s_data[threadId]=s_data[threadId]+s_data[threadId+1];

   }

you can save divergent cost and __synch cost.

Hi LSChien,

that was a good hint! My latest time per run was

1.76ms

the new one is

1.36ms

thats nice! Thank you very much.

Another question that arose:

Is there an explanation, why that problem cannot be solved as good as other problems by cuda?

Julian

Edit: I got another 0.15ms per run by changing the blocksize from 512threads to 256 threads

really?

512 threads per block : 1.36 ms

256 threads per block : 0.15 ms

This is strange.

since your array is 1-dimension, I will suggest that

change

unsigned int blockId = gridDim.y*blockIdx.x+blockIdx.y;

to

unsigned int blockId = gridDim.x*blockIdx.y + blockIdx.x;

such that adjacent thread blocks access contiguous memory block.

your problem is memory-intensive, performance is limited by bandwidth.

bandwidth of GTX295 is 112GB/sec (only use one GPU)

In your problem, you need to load 120,000 64-integer from global memory into shared memory.

performance is bounded by “load data into shared memory”

120,000 x 64 x 4 (byte) / (112GB/s) = 0.274 ms

however this number is larger than time of blocksize=256.

Can you describe how do you measure execution time of your kernel?

this was not clear enough. I meant, that I reduced the time per run by 0.15ms so now I’m at 1.21ms/run

EDIT: Measurement:

I use the cutil timer. I start it before kernel invokation, after kernel invokation I do a cudaThreadSync then I stop the timer and print its value.