speeding up k-means on the gpu

greetings,

i have been implementing simple k-means on the gpu last week and am now wondering how i can speed up the whole thing. i improved the performance iteratively, starting with the most naive implementation giving < 1gflops up to what i have at the moment yielding ~20gflops on a 8800 gts. i would very much appreciate any hints that might speed up my implementation.

what i basically do: given n d-dimensional vectors and k d-dimensional vectors ( centroids ) i calculate the distance between every n and centroid, yielding n*k distance calculations. each vector is assigned to the closest centroid. this is reflected by an extra dimension for each vector that holds the centroid’s id it is closest too.

the kernel looks like this:

//

// gpu kernel for labeling datapoints. uses shared memory

//

// data ... device pointer to data points

// points ... number of points

// centroids ... device pointer to data points

// clusters ... number of clusters

// dimensions ... number of elements per vector ( excluding clusterid )

// moved ... device ptr 

// pitch ... pitch in floats of data array

//

__global__ void kmeans_kernel_3( float* data, int points, 

          float* centroids, int clusters, 

      	int dimensions, 

      	int* moved, 

      	int pitch )

{  	

	extern __shared__ float means[];

	float* vector = &data[blockDim.x * blockIdx.x + threadIdx.x];

	int iterations = points / ( blockDim.x * gridDim.x ) + 1;

	while( iterations --)

	{

  float mindist = 3.402823466e+38F;    

  int index = 0;

 for( int j = 0; j < clusters; j++ )

  {

  	if( threadIdx.x == 0 )

    loadVector( means, &centroids[j*(dimensions+1)], dimensions + 1 );

  	__syncthreads();  

 	if( !(vector - data > points - 1) )

  	{

    float dist = distance_gpu_transpose( means, vector, dimensions, pitch );

    if( dist < mindist )

    {

    	mindist = dist;

    	index = j;

    }	

  	}  	

  	__syncthreads();

  }

  if( !(vector - data > points - 1) )

  {

  	if( vector[0] != index )

  	{

    vector[0] = index;

    *moved = 1;

  	}

  }

 vector += (gridDim.x * blockDim.x);    

	}

}

//

// loads a centroids from global memory to shared memory

//

//

__device__ void loadVector( float *target, float* source, int dimensions )

{

	for( int i = 0; i < dimensions; i++ )

  target[i] = source[i];	

}

//

// calculates the euclidian distance between two vectors, second is assumed to be

// transposed

// 

// v1 ... device pointer to vector one

// v2 ... device pointer to vector two

// dimensions ... number of elements per vector ( excluding clusterid )

// return: distance between points

//

__device__ float distance_gpu_transpose( float *v1, float *v2, int dimensions, int pitch )

{

	float dist = 0;	

	for( int i = 1; i < dimensions + 1; i++ )

	{

  float tmp = v2[i*pitch] - v1[i];

  dist += tmp * tmp;

	}

	return sqrt(dist);

}

the input vectors are laid out in a transposed manner, that is

x0 x1 x2…

y0 y1 y2…

in order to allow for coalescing. i use a linear grid / block layout. each block is calculating the minimum distance of #threads_per_block successive vectors to the centroids ( means in the above code ). since each thread is computing the distance from it’s current vector to the same centroid other threads use at the same time i load the centroids into shared memory.

as i said on my card ( 8800 gts ) i get around 20gflops. per kernel invocation i count nk( 1 + (3 * d + 1)) floating point operations. 1 for the comparison between the calculated distance and the current minimum distance and 3 * d + 1 for the distance calculation between a vector and a centroid.

i went from 0.6gflops ( naive implementation ) to 4gflops ( shared memory caching of centroids ) to 20gflops ( global memory coalescing ). any ideas how to further improve this? ( not that 20gflops is all that bad :) )

thanks in advance and excuse my lengthy post,

mario

It looks like dimensions is a small, fixed number. Instead of making it a normal parameter, you could make it a template parameter to kmeans_kernel_3, loadVector, and distance_gpu_transpose. This will allow the compiler to unroll your short loops, which might speed things up further. (Sometimes unrolling loops can backfire by increasing register usage, so keep your old code handy in case you need to revert it.)

nice, i wasn’t aware of function templates as the programming guide didn’t talk about those in the case of kernels ( afair :) ). i just added this and it increased performance by ~13%. of course, instantiating the template at compile time is somewhat limiting as the dimensions and centroids ( i tempalted both parameters ) will vary from dataset to dataset. never the less, a nice speedup. thanks for the tip.

@registers: i noted a decrease in register use from 16 to 14. i guess that’s due to the elemination of the two for loops in the kernel and the respective counter variables. i wonder, is there a code size limit? that is, what would happen if parameter dimensions was set to say 54 and parameter clusters to 100.

edit: another question: i guess 20gflops is really the upper bound for such an algorithm on the gpu as i haven’t got enough calculations per memory access. i’d be happy if the experts around here could share the experience with non-computation-intensive algorithms. i just want to make sure that i can not get faster without sacrifycing generality of this implementation ( currently i’m only bound by the maximum number of dimensions, due to the fact that i hold centroids in shared memory. )

If you know the range of possible values, you can use a switch statement to instantiate the template for the different possibilities. It adds some bloat to your cubin, which will now have N kernels, but each will be optimized for the dimensionality of the dataset.

The code size limit is pretty large. I seem to recall on the order of megabytes, but you’ll have to search the forum to find the exact number. You should also check out Mark Harris’s optimization talk from SC2007 (hah, second time in a week I’ve plugged this presentation), which gave me some great ideas for improving my code:

http://www.gpgpu.org/sc2007/SC07_CUDA_5_Op…tion_Harris.pdf

thanks again for the hints. i already thought about instantiating multiple template version via a switch. the problem is that in this domain is way to big in our use-case to be implemented via a switch. i could imagine runtime-compilation if possible. i’m still not decided on wheter it’s worth all this for an additional 12% boost.

i’ve read harris’ presentation a couple of weeks ago in preperation for my contact with CUDA. i might revisit it, i think he described loop unrolling too. thanks for this hint.

as for my guess on the maximum bound: i just augmented the distance_transpose_gpu method with some additional operations on register values, double checking that nvcc isn’t eliminating them. adding 36 operations i have now 162gflops. so i guess i will investigate kernel-kmeans now :) ( where kernel is a mercer kernel not a cuda kernel ). those should more than make up for the memory latency.

sorry for my long posts, i just believe that they might help others in their quest for high performance computing.

and thanks again for your reply.

You mention the number of GFLOP/s you get. But what about GiB/s in memory bandwith? Certainly your kernel will be limited by that and not floating point operations. 70 GiB/s is achievable on a GTX.

i’m aware that my code is a lot more bandwidth than flop bound. however, i’m not entirely sure how to time the actual bandwidth on the gpu. also, how could i increase bandwidth on the gpu apart from coalescing and reducing bank conflicts? i think i got those two right. i have not yet found out how to get information from the profiler on bank conflicts / non-coalescing memory access.

There is no easy way to count the number of memory operations. You need to go through your code by hand, and come up with a formula for the number of memory operations performed given input parameters and the grid/block sizes. As long as your memory accesses are deterministic, this can always be done. If the number of accesses depend on the data, and even worse: values calculated from within the data, then it gets more complicated.

You can tell the profiler to count coalesced vs. non-coalesced reads by putting the lines
gld_coherent
gld_incoherent
into a file and setting CUDA_PROFILE_CONFIG to that file name. I’m going on memory here, so I may have the names or underscores off a little bit, check the CUDA_PROFILER documentation in the toolkit doc/ directory.

i guess i was just to lazy to count memory accesses at that point. portions of the code are non-deterministic, but with some straight forward calculations i think i can get a good approximation.

thanks for that advice, i must have overlooked that in the readme. i tested and fixed the code accordingly. i get perfectly coalesced loads and only minor uncoalesced stores with the fixed version. the uncoalesced stores arise from the assignement to move which i think i can neglect.

thank you very much for your suggestions.