Removing for inside kernel function

Hello everyone, i’m coding a CUDA version of a 2D Kmeans and i’m obtaining a ~21 speedup from the sequential version.
My kernel that associate points is almost like this:

__global__ void function_that_assign_point_to_cluster(float* punti, float* clusters) {
    ... do work, a thread = a point...
    //for each cluster find the best fit:
    for (int i = 0; i < CLUSTER_NUM; i++) { 
        .. assign point...			
    ... do work ...

Usually CLUSTER_NUM is a small number such as 10,20 or 50 but is this loop a problem for my performance? Putting a relative small loop inside a thread can be considerated really a bad tecnique? can it be done better?
Visual profiler says that this function occupies almost 100% of GPU time with a mean time of execution of 22ms.

Thank you.