Hello everyone, I’m facing a problem in doing an assignment in which i have to implement a 2D k-means algorithm in CUDA C.
I have a kernel that assigns points to clusters (each thread represent a point), I then have to increment the “size” of the clusters and the number of points contained in them, both points and clusters are matrixes linearized as array of doubles using row major scheme and they are both passed to the GPU using the global memory:
cudaMalloc(&punti_d, POINT_NUM * POINT_FEATURES * sizeof(double));
cudaMalloc(&cluster_d, CLUSTER_NUM * CLUSTER_FEATURES * sizeof(double));
cudaMemcpy(punti_d, punti, POINT_NUM * POINT_FEATURES * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(cluster_d, clusters, CLUSTER_NUM * CLUSTER_FEATURES * sizeof(double), aMemcpyHostToDevice);
Then i have the kernel that assigns points → clusters:
__global__ void assign_clusters(double* punti, double* clusters) {
...do work to find the best fitting cluster with index "best fit"...
atomicAdd(&clusters[best_fit * CLUSTER_FEATURES + 1], x_punto); //add x size
atomicAdd(&clusters[best_fit * CLUSTER_FEATURES + 2], y_punto); //add y size
atomicAdd(&clusters[best_fit * CLUSTER_FEATURES + 3],1); //add 1 point
}
The problem is that this implementation using AtomicAdds to increment values inside the cluster array is terribly slow, 4seconds compared to 0.1 seconds with 1.000.000 points and 20 clusters, how can i solve this?
I think that is quite necessary that i do this because otherwise it can present a race condition in the sum and some calculation can be inexact, is there a more efficient way to add atomically these variables?
I use this vesion of atomic add:
static __inline__ __device__ double atomicAdd(double* address, double val) {
unsigned long long int* address_as_ull = (unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
if (val == 0.0)
return __longlong_as_double(old);
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
I use this version because in visual studio 2019 with cuda 11 it does not find the “original” version of it.
I hope that someone can help me… thanks everyone.