greetings,

i have been implementing simple k-means on the gpu last week and am now wondering how i can speed up the whole thing. i improved the performance iteratively, starting with the most naive implementation giving < 1gflops up to what i have at the moment yielding ~20gflops on a 8800 gts. i would very much appreciate any hints that might speed up my implementation.

what i basically do: given n d-dimensional vectors and k d-dimensional vectors ( centroids ) i calculate the distance between every n and centroid, yielding n*k distance calculations. each vector is assigned to the closest centroid. this is reflected by an extra dimension for each vector that holds the centroid’s id it is closest too.

the kernel looks like this:

```
//
// gpu kernel for labeling datapoints. uses shared memory
//
// data ... device pointer to data points
// points ... number of points
// centroids ... device pointer to data points
// clusters ... number of clusters
// dimensions ... number of elements per vector ( excluding clusterid )
// moved ... device ptr
// pitch ... pitch in floats of data array
//
__global__ void kmeans_kernel_3( float* data, int points,
float* centroids, int clusters,
int dimensions,
int* moved,
int pitch )
{
extern __shared__ float means[];
float* vector = &data[blockDim.x * blockIdx.x + threadIdx.x];
int iterations = points / ( blockDim.x * gridDim.x ) + 1;
while( iterations --)
{
float mindist = 3.402823466e+38F;
int index = 0;
for( int j = 0; j < clusters; j++ )
{
if( threadIdx.x == 0 )
loadVector( means, ¢roids[j*(dimensions+1)], dimensions + 1 );
__syncthreads();
if( !(vector - data > points - 1) )
{
float dist = distance_gpu_transpose( means, vector, dimensions, pitch );
if( dist < mindist )
{
mindist = dist;
index = j;
}
}
__syncthreads();
}
if( !(vector - data > points - 1) )
{
if( vector[0] != index )
{
vector[0] = index;
*moved = 1;
}
}
vector += (gridDim.x * blockDim.x);
}
}
//
// loads a centroids from global memory to shared memory
//
//
__device__ void loadVector( float *target, float* source, int dimensions )
{
for( int i = 0; i < dimensions; i++ )
target[i] = source[i];
}
//
// calculates the euclidian distance between two vectors, second is assumed to be
// transposed
//
// v1 ... device pointer to vector one
// v2 ... device pointer to vector two
// dimensions ... number of elements per vector ( excluding clusterid )
// return: distance between points
//
__device__ float distance_gpu_transpose( float *v1, float *v2, int dimensions, int pitch )
{
float dist = 0;
for( int i = 1; i < dimensions + 1; i++ )
{
float tmp = v2[i*pitch] - v1[i];
dist += tmp * tmp;
}
return sqrt(dist);
}
```

the input vectors are laid out in a transposed manner, that is

x0 x1 x2…

y0 y1 y2…

…

in order to allow for coalescing. i use a linear grid / block layout. each block is calculating the minimum distance of #threads_per_block successive vectors to the centroids ( means in the above code ). since each thread is computing the distance from it’s current vector to the same centroid other threads use at the same time i load the centroids into shared memory.

as i said on my card ( 8800 gts ) i get around 20gflops. per kernel invocation i count n*k*( 1 + (3 * d + 1)) floating point operations. 1 for the comparison between the calculated distance and the current minimum distance and 3 * d + 1 for the distance calculation between a vector and a centroid.

i went from 0.6gflops ( naive implementation ) to 4gflops ( shared memory caching of centroids ) to 20gflops ( global memory coalescing ). any ideas how to further improve this? ( not that 20gflops is all that bad :) )

thanks in advance and excuse my lengthy post,

mario