Adding a variable inside a buffer from multiple threads

Hello everyone, I’m facing a problem in doing an assignment in which i have to implement a 2D k-means algorithm in CUDA C.
I have a kernel that assigns points to clusters (each thread represent a point), I then have to increment the “size” of the clusters and the number of points contained in them, both points and clusters are matrixes linearized as array of doubles using row major scheme and they are both passed to the GPU using the global memory:

cudaMalloc(&punti_d, POINT_NUM * POINT_FEATURES * sizeof(double));
cudaMalloc(&cluster_d, CLUSTER_NUM * CLUSTER_FEATURES * sizeof(double));
cudaMemcpy(punti_d, punti, POINT_NUM * POINT_FEATURES * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(cluster_d, clusters, CLUSTER_NUM * CLUSTER_FEATURES * sizeof(double), aMemcpyHostToDevice);

Then i have the kernel that assigns points → clusters:

__global__ void assign_clusters(double* punti, double* clusters) { work to find the best fitting cluster with index "best fit"...
            atomicAdd(&clusters[best_fit * CLUSTER_FEATURES + 1], x_punto); //add x size
	        atomicAdd(&clusters[best_fit * CLUSTER_FEATURES + 2], y_punto); //add y size
            atomicAdd(&clusters[best_fit * CLUSTER_FEATURES + 3],1);  //add 1  point

The problem is that this implementation using AtomicAdds to increment values inside the cluster array is terribly slow, 4seconds compared to 0.1 seconds with 1.000.000 points and 20 clusters, how can i solve this?
I think that is quite necessary that i do this because otherwise it can present a race condition in the sum and some calculation can be inexact, is there a more efficient way to add atomically these variables?
I use this vesion of atomic add:

static __inline__ __device__ double atomicAdd(double* address, double val) {
	unsigned long long int* address_as_ull = (unsigned long long int*)address;
	unsigned long long int old = *address_as_ull, assumed;
	if (val == 0.0)
		return __longlong_as_double(old);
	do {
		assumed = old;
		old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed)));
	} while (assumed != old);
	return __longlong_as_double(old);

I use this version because in visual studio 2019 with cuda 11 it does not find the “original” version of it.

I hope that someone can help me… thanks everyone.

what GPU are you running on?

I’m running on a GTX 950 2GB windforce

can you do clustering with float instead of double?

are you building a debug project or are you building a release project?

I can try to change from double to float, i’m working in a debug project but i have tried to build the release one and i notice the same problem, even using Visual profiler it confirms that most of the execution time is occupied by “arithmetic operations”, if i remove the atomic adds (just to try) the time decreases a lot as i said originally.
Thank you btw for your fast reply.

In my experience, a debug project will run much more slowly than a release project. I personally would never try to do performance analysis on a debug project. Yes, you may still have performance issues, but I suspect your quoted 4 s number would drop substantially if you switch to a release project.

simply removing the atomic adds is likely not a valid test. The compiler is an optimizing compiler and may use dead code removal, if it finds code that has no impact on global state. commenting out code is a troublesome, potentially misleading method for performance analysis, for this reason.

If you only have 20 clusters, you may wish to switch to a shared atomics strategy rather than global atomics. See here but I would try to switch from double to float first. When you switch from double to float you can use the built-in atomicAdd rather than the atomicCAS loop method.

Ok i basically switched it to float and i try using the default atomicAdd, it is much better now. thank you.