Hi there,
I got this Problem:
I have an array of x times 64 integers cells. Each 64wide block can be understood as a 64Dimensional Vector.
I have to calculate the euclidean distance of all x vectors to each others (sum over all squared distances).
My x is about 120’000 and I have a Kernel, that does the calculation FOR ONE VECTOR and stores all the distances to the other vectors into an array. Per vector it takes about 2ms which I think of beeing suboptimal.
I have no further ideas where I could tune my kernel so I post it here. Perhaps someone has the one or the other idea.
Every hint is welcome!
here is the Kernel  some words:
I have a 2Dimensional GridDim and a 2Dimensional BlockDim. The grid dimension doesn’t matter here. But BlockDim.x=64 and BlockDim.y=8 so there are 512 threads per Block and the 64 equates the length of each vector.
__global__ void square_sum(int *a, int *b, int candidate, int N)
{
extern __shared__ int s_data[];
__shared__ int s_candidate[64];
unsigned int blockId = gridDim.y*blockIdx.x+blockIdx.y;
unsigned int threadId = threadIdx.y*blockDim.x+threadIdx.x;
unsigned int idx = blockId*blockDim.x*blockDim.y + threadId;
if (idx < N){
if (threadIdx.y==0)
s_candidate[threadIdx.x] = a[64*candidate + threadIdx.x];
__syncthreads();
int diff= a[idx]s_candidate[threadIdx.x];
//s_data[threadId]=(a[idx]s_candidate[threadIdx.x])*(a[idx]s_candidate[threadIdx.x]);
s_data[threadId]=diff*diff;
__syncthreads();
}
//int s=32;
if (threadIdx.x<32)
s_data[threadId]=s_data[threadId]+s_data[threadId+32];
//__syncthreads();
//s=16;
if (threadIdx.x<16)
s_data[threadId]=s_data[threadId]+s_data[threadId+16];
//__syncthreads();
//s=8;
if (threadIdx.x<8)
s_data[threadId]=s_data[threadId]+s_data[threadId+8];
//__syncthreads();
//s=4;
if (threadIdx.x<4)
s_data[threadId]=s_data[threadId]+s_data[threadId+4];
//__syncthreads();
//s=2;
if (threadIdx.x<2)
s_data[threadId]=s_data[threadId]+s_data[threadId+2];
//__syncthreads();
//s=1;
if (threadIdx.x<1)
s_data[threadId]=s_data[threadId]+s_data[threadId+1];
if ((threadIdx.x == 0)) b[blockId*blockDim.y+threadIdx.y] = s_data[blockDim.x*threadIdx.y];
}
Interesting for me is:

Faster data transfer from and to global memory (takes about 1.8ms) ~30GB/s on a GTX295

Solving possible problems with shared memory access.

Optimizing the sum algorithm if possible.

or is this already optimal?
Thanks in advance