Hello everyone, i’m coding a CUDA version of a 2D Kmeans and i’m obtaining a ~21 speedup from the sequential version.
My kernel that associate points is almost like this:
__global__ void function_that_assign_point_to_cluster(float* punti, float* clusters) {
... do work, a thread = a point...
//for each cluster find the best fit:
for (int i = 0; i < CLUSTER_NUM; i++) {
.. assign point...
}
... do work ...
}
Usually CLUSTER_NUM is a small number such as 10,20 or 50 but is this loop a problem for my performance? Putting a relative small loop inside a thread can be considerated really a bad tecnique? can it be done better?
Visual profiler says that this function occupies almost 100% of GPU time with a mean time of execution of 22ms.
Thank you.