Cluster of points similarity calculation

I’m learning CUDA C architecture for my course and this is my task.
Cluster is constructed by structure a simple structure.

struct Cluster {
	float* x;
	float* y;
	float* z;

Two clusters are then filled with N points and their similarity is calculated on the CPU with

float CPU_result(Cluster A, Cluster B, int n) {
	float difference = 0.0f;
	for (int i = 0; i < n-1; i++) {
		float tmp = 0.0f;
		for (int j = i+1; j < n; j++) {
			float diff_a = sqrt((A.x[i]-A.x[j])*(A.x[i]-A.x[j])
				+ (A.y[i]-A.y[j])*(A.y[i]-A.y[j])
				+ (A.z[i]-A.z[j])*(A.z[i]-A.z[j]));
			float diff_b = sqrt((B.x[i]-B.x[j])*(B.x[i]-B.x[j])
				+ (B.y[i]-B.y[j])*(B.y[i]-B.y[j])
				+ (B.z[i]-B.z[j])*(B.z[i]-B.z[j]));
			tmp += (diff_a-db) * (diff_a-diff_b);
		diff += tmp;
	return sqrt(1/((float)n*((float)n-1)) * difference);

My first (very poorly) written implemenation looks like this

__global__ void compute_gpu(ClusterA, ClusterB, int n ,float* result){
    //determine unique index within grid, miximizing block usage
    int i = threadIdx.y + blockIdx.y * blockDim.y;
    int j = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n && j < n && j > i && i != j){
        float tmp = 0.0f;
        float diff_a = sqrt((A.x[i]-A.x[j])*(A.x[i]-A.x[j])
                    + (A.y[i]-A.y[j])*(A.y[i]-A.y[j])
                    + (A.z[i]-A.z[j])*(A.z[i]-A.z[j]));
        float diff_b = sqrt((B.x[i]-B.x[j])*(B.x[i]-B.x[j])
                    + (B.y[i]-B.y[j])*(B.y[i]-B.y[j])
                    + (B.z[i]-B.z[j])*(B.z[i]-B.z[j]));
        tmp = (diff_a-diff_b) * (diff_a-diff_b);
        atomicAdd(result, tmp);

        return ;

float GPU_result(Cluster A, Cluster B, int n) {
    float h_result[n];
    float *d_result;

    //allocate space on GPU for d_result variable
    cudaMalloc((void**)&d_result, sizeof(float));

    //define grid and block sizes
    dim3 block_size(16,16);
    dim3 grid_size((n+15)/16,  (n+15)/16);

    //Launch kernel with pass by referennce attribute and N size
    compute_gpu <<<grid_size, block_size>>>(A, B, n,d_result);

    //copy data from GPU memory to CPU memory via PCIe bus
    cudaMemcpy(h_result, d_result, sizeof(float), cudaMemcpyDeviceToHost);

    //use overall square root out of GPU kernel, to avoid race condition
    *h_result = sqrt(1/((float)n*((float)n-1)) * (* h_result));

    return *h_result;

My problem is, that GPU is computing rather presisely for N = 3000(with really bad performance), but as the number got higher, it surpasses accepted error of 1%.
I know this is not very bad performance-wise implementation and I would welcome any performance advices, but my primary goal now is get kernel computing precisely for much higher N (tens of thousands) values. Any advices what might be going wrong here?
Thanks in advance.

Without knowing an actual/specific test case, this may simply be due to the limits of float precision. You may disagree; in any event I wouldn’t be able to spend more time on it without an actual, complete test case.

From a performance perspective, you have a AoS data arrangement, and you may get better performance switching to SoA. This goes into more detail.

How is that error being measured? If you are using your CPU implementation (or a different single-precision implementation) as the reference, that is not a valid methodology. If you are comparing against a double-precision implementation as a “golden” reference, that would likely result in a valid assessment. Accumulated round-off error in low-precision float arithmetic may be a problem at high dimensions, but that should affect your CPU implementation as well.

As far as I could see looking at the code for ten seconds, it computes √(x2+y2+z2). For that you would want to use normd3df() for maximum accuracy. For example:

float diff_a = norm3df (A.x[i]-A.x[j], A.y[i]-A.y[j], A.z[i]-A.z[j]);
1 Like

Thanks for your answer.

|(result_cpu - result gpu) / (result_cpu + result_gpu)| < 1%

Anyway I have implemented this by array sum reduction and it is working just fine. Now I’m trying to cuda shared memory between different kernel calls, since this kernel is called multiple times, and previous calls affects overall result.

I’ve managed to solve by completely rebuilding my implementation which made much more sense and is faster.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.