I have an array of x times 64 integers cells. Each 64-wide block can be understood as a 64-Dimensional Vector.
I have to calculate the euclidean distance of all x vectors to each others (sum over all squared distances).
My x is about 120’000 and I have a Kernel, that does the calculation FOR ONE VECTOR and stores all the distances to the other vectors into an array. Per vector it takes about 2ms which I think of beeing suboptimal.
I have no further ideas where I could tune my kernel so I post it here. Perhaps someone has the one or the other idea.
Every hint is welcome!
here is the Kernel - some words:
I have a 2-Dimensional GridDim and a 2-Dimensional BlockDim. The grid dimension doesn’t matter here. But BlockDim.x=64 and BlockDim.y=8 so there are 512 threads per Block and the 64 equates the length of each vector.
__global__ void square_sum(int *a, int *b, int candidate, int N)
{
extern __shared__ int s_data[];
__shared__ int s_candidate[64];
unsigned int blockId = gridDim.y*blockIdx.x+blockIdx.y;
unsigned int threadId = threadIdx.y*blockDim.x+threadIdx.x;
unsigned int idx = blockId*blockDim.x*blockDim.y + threadId;
if (idx < N){
if (threadIdx.y==0)
s_candidate[threadIdx.x] = a[64*candidate + threadIdx.x];
__syncthreads();
int diff= a[idx]-s_candidate[threadIdx.x];
//s_data[threadId]=(a[idx]-s_candidate[threadIdx.x])*(a[idx]-s_candidate[threadIdx.x]);
s_data[threadId]=diff*diff;
__syncthreads();
}
//int s=32;
if (threadIdx.x<32)
s_data[threadId]=s_data[threadId]+s_data[threadId+32];
//__syncthreads();
//s=16;
if (threadIdx.x<16)
s_data[threadId]=s_data[threadId]+s_data[threadId+16];
//__syncthreads();
//s=8;
if (threadIdx.x<8)
s_data[threadId]=s_data[threadId]+s_data[threadId+8];
//__syncthreads();
//s=4;
if (threadIdx.x<4)
s_data[threadId]=s_data[threadId]+s_data[threadId+4];
//__syncthreads();
//s=2;
if (threadIdx.x<2)
s_data[threadId]=s_data[threadId]+s_data[threadId+2];
//__syncthreads();
//s=1;
if (threadIdx.x<1)
s_data[threadId]=s_data[threadId]+s_data[threadId+1];
if ((threadIdx.x == 0)) b[blockId*blockDim.y+threadIdx.y] = s_data[blockDim.x*threadIdx.y];
}
Interesting for me is:
Faster data transfer from and to global memory (takes about 1.8ms) ~30GB/s on a GTX295
Solving possible problems with shared memory access.
In the meantime I replaced all multiplications with __umul24() or __mul24() which gave 0.3ms :)
Now I have an Idea: The vector, to which I want to calculate the distances lies in between all the other vectors.
It has to be read from global to shared memory in every single block (there are 14881 blocks) which makes extra 3.6MB of transfer from global to shared memory.
Is there another memory-option where I could put this vector to, so that every thread in every block can quicly read it?
warp is a basic unit, it is __synch implicitly. you don’t need divergent path for threadIdx.x < 32
your thread block is 64 x 8 and then 0 <= threadIdx.x < 32 becomes a warp.
tyr to modify the code
//int s=32;
if (threadIdx.x<32)
s_data[threadId]=s_data[threadId]+s_data[threadId+32];
//__syncthreads();
//s=16;
if (threadIdx.x<16)
s_data[threadId]=s_data[threadId]+s_data[threadId+16];
//__syncthreads();
//s=8;
if (threadIdx.x<8)
s_data[threadId]=s_data[threadId]+s_data[threadId+8];
//__syncthreads();
//s=4;
if (threadIdx.x<4)
s_data[threadId]=s_data[threadId]+s_data[threadId+4];
//__syncthreads();
//s=2;
if (threadIdx.x<2)
s_data[threadId]=s_data[threadId]+s_data[threadId+2];
//__syncthreads();
//s=1;
if (threadIdx.x<1)
s_data[threadId]=s_data[threadId]+s_data[threadId+1];