Floating atomicMin() help!!!

Hey all, I’m trying write some CUDA code here. Bear in mind, I’m still a noob with this kind of stuff.

What I’m trying to do is this :

We have a tetrahedron, t. We have a point set and we’re trying to find the point that is closest to the center of the circumsphere of the tetrahedron (yay, computational geometry). I can do all the calculations fine but I’m trying to use atomicMin to find the point that is closest to the center. We know which point we want because it’ll have the largest negative value (in the way that -3 is a larger negative value than -1).

So, here’s my code/attempt :

__global__ void point_distance(point *p_buffer, tetra *t, int num_points, float *dist_array, int *global_min, float *real_min) {

    int i = blockIdx.x*blockDim.x + threadIdx.x;

    if (i < num_points) {

        point ba = p_buffer[t->p[1]] - p_buffer[t->p[0]];
        point ca = p_buffer[t->p[2]] - p_buffer[t->p[0]];
        point da = p_buffer[t->p[3]] - p_buffer[t->p[0]];
        point pa = p_buffer[i + 4]   - p_buffer[t->p[0]];

float o[4][4] = { { ba.x, ba.y, ba.z, ba.q },
                          { ca.x, ca.y, ca.z, ca.q },
                          { da.x, da.y, da.z, da.q },
                          { pa.x, pa.y, pa.z, pa.q } };

        dist_array[i] = o[0][0] * (o[1][1]*(o[2][2]*o[3][3] - o[3][2]*o[2][3]) - 
                                   o[1][2]*(o[2][1]*o[3][3] - o[3][1]*o[2][3]) + 
                                   o[1][3]*(o[2][1]*o[3][2] - o[3][1]*o[2][2]))

                      - o[0][1] * (o[1][0]*(o[2][2]*o[3][3] - o[3][2]*o[2][3]) - 
                                   o[1][2]*(o[2][0]*o[3][3] - o[3][0]*o[2][3]) + 
                                   o[1][3]*(o[2][0]*o[3][2] - o[3][0]*o[2][2]))

                      + o[0][2] * (o[1][0]*(o[2][1]*o[3][3] - o[3][1]*o[2][3]) - 
                                   o[1][1]*(o[2][0]*o[3][3] - o[3][0]*o[2][3]) + 
                                   o[1][3]*(o[2][0]*o[3][1] - o[3][0]*o[2][1]))

                      - o[0][3] * (o[1][0]*(o[2][1]*o[3][2] - o[3][1]*o[2][2]) - 
                                   o[1][1]*(o[2][0]*o[3][2] - o[3][0]*o[2][2]) + 
                                   o[1][2]*(o[2][0]*o[3][1] - o[3][0]*o[2][1]));

        atomicMin(global_min, __float_as_int(dist_array[i]));
    }

    *real_min = __int_as_float(*global_min);

    return;
}

And so far, my real_min is -5832.

I’m calling the kernel like this :

int *min_host = 0;
    int *min_device = 0;

    cudaMallocHost((void** ) &min_host, sizeof(int));
    cudaMalloc((void** ) &min_device, sizeof(int));

    float *float_host = 0;
    float *float_device = 0;

    cudaMallocHost((void** ) &float_host, sizeof(float));
    cudaMalloc((void** ) &float_device, sizeof(float));

    point_distance <<<nBlocks, blockSize>>> (p_buffer_device, t_buffer_device, num_points, circumsphere_distance_device, min_device, float_device);

    cudaMemcpy(circumsphere_distance_host, circumsphere_distance_device, num_points*sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy(float_host, float_device, sizeof(float), cudaMemcpyDeviceToHost);

    for (int i = 0; i < num_points; ++i) {
        std::cout << circumsphere_distance_host[i] << ", " << p_buffer_host[i + 4] << std::endl;
    }

    std::cout << "\nglobal min : " << *float_host << std::endl;

    cudaFree(min_device);
    cudaFreeHost(min_host);

So, how do I do this? Also, here’s sample output of every point and it’s distance (fiducial tetrahedron has vertices (0, 0, 0), (9, 0, 0), (0, 9, 0) and (0, 0, 9) in that order) :

0, 0 0 0
-5832, 0 0 1
-10206, 0 0 2
-13122, 0 0 3
-5832, 0 1 0
-11664, 0 1 1
-16038, 0 1 2
-18954, 0 1 3
-10206, 0 2 0
-16038, 0 2 1
-20412, 0 2 2
-23328, 0 2 3
-13122, 0 3 0
-18954, 0 3 1
-23328, 0 3 2
-26244, 0 3 3
-5832, 1 0 0
-11664, 1 0 1
-16038, 1 0 2
-18954, 1 0 3
-11664, 1 1 0
-17496, 1 1 1
-21870, 1 1 2
-24786, 1 1 3
-16038, 1 2 0
-21870, 1 2 1
-26244, 1 2 2
-29160, 1 2 3
-18954, 1 3 0
-24786, 1 3 1
-29160, 1 3 2
-32076, 1 3 3
-10206, 2 0 0
-16038, 2 0 1
-20412, 2 0 2
-23328, 2 0 3
-16038, 2 1 0
-21870, 2 1 1
-26244, 2 1 2
-29160, 2 1 3
-20412, 2 2 0
-26244, 2 2 1
-30618, 2 2 2
-33534, 2 2 3
-23328, 2 3 0
-29160, 2 3 1
-33534, 2 3 2
-36450, 2 3 3
-13122, 3 0 0
-18954, 3 0 1
-23328, 3 0 2
-26244, 3 0 3
-18954, 3 1 0
-24786, 3 1 1
-29160, 3 1 2
-32076, 3 1 3
-23328, 3 2 0
-29160, 3 2 1
-33534, 3 2 2
-36450, 3 2 3
-26244, 3 3 0
-32076, 3 3 1
-36450, 3 3 2
-39366, 3 3 3

I don’t think that __float_as_int() is what you want to use. It reinterprets the bits of the IEEE-754 floating point representation as a 32-bit integer, which are not going to sort in the order you want in general. You will either need to cast the float to an integer (which will lose precision if you have fractional values) or use a trick like that shown here:

http://stereopsis.com/radix.html

Couldn’t I also just use thrust::min_element? Or would that have to be done host side?

And even if I did perform the operation host side, isn’t it also possible that it’d be more efficient than an atomic operation?

I don’t have any experience with thrust, so I can’t speak to how it is implemented.

Assuming you are using a Kepler GPU (or later), I think doing the atomic min on the GPU as you do the calculation is likely to be faster than copying the data back to the host to find the minimum.

Fair enough.

As a C++ coder, thrust is very attractive.

But all I have is a GTX 460 which I think is the Fermi architecture. I’m not really savvy with nVidia products so I have no idea if a Kepler is better or not. Having a nicer GPU would be very nice though XD

Yes, the GTX 460 is the Fermi architecture. (Kepler started with GTX 600 series.) The reason I bring up Kepler is because in that architecture the performance of atomic operations on global memory was improved by a factor of 7x or more over Fermi, making them a practical alternative to parallel reductions for many use cases.

Since you have a Fermi GPU, I would investigate whether Thrust can do a global memory reduction on the device. (I suspect it can.) That’s probably the fastest and easiest option for your situation.

Oh hey, I just figured I should pop back in and say that I finally got the AtomicMin() to work. Such a pain but it seems to be working now. God, CUDA is a beast lol. Thank you for the link, btw.