Anyone have a simple k-means implementation for CUDA?

Just wondering if anyone has a simple k-means algorithm implemented on CUDA. I have found one fairly optimized version by Chi Kit LAM/Qiong Luo, but I was just wondering if anyone had anything much simpler that I could use. Thanks!