Finding minimun "distance"


I have a question regarding a problem I have with cuda. Let’s say I have a number of blocks and a number of threads. Each thread calculates a “distance” between points on the X-Y plane. What I want is to return the threadId which holds the minimum distance among the distances calculated between all threads. Could you please assist me on how to implement that in a generally fast wait? I use __syncthreads() so that every thread is done with its calculation, but after that I have no idea how to compare the distances between the threads…


Take a look at the “CUDA Parallel Reduction” SDK program. It sounds like this is pretty close to that you’re wanting to do

I love the SDK N body simulation

Thank you alrikai, that was very helpful. I have one more question. Does cuda support unsigned long long type? I’m trying to declare share unsigned long long[128[ and unsigned long long tid=threadIdx+blockIdx.x*blockDim.x. The code compiles correctly but I’m getting wrong results when I exceed the range of int, which means more than 2,147,483,647

Any ideas?

CUDA does support unsigned long long types; the following is shamelessly plagiarized from the CUDA Programming Guide 4.0, section B.11:

__device__ double atomicAdd(double* address, double val)


    unsigned long long int* address_as_ull = (unsigned long long int*)address;

    unsigned long long int old = *address_as_ull, assumed;

    do {

        assumed = old;

        old = atomicCAS(address_as_ull, assumed,

        __double_as_longlong(val + __longlong_as_double(assumed)));

    } while (assumed != old);

    return __longlong_as_double(old);


As for why unsigned long long’s aren’t working for you, that’s a bit more difficult to infer. Is your device’s compute capability >= 1.3?

The 64-bit types long long and unsigned long long are supported by CUDA independent of compute capability. To avoid intermediate overflow in the expression whose result is assigned to the (unsigned) long long, make sure to cast one of the operands on the right-hand side to (unsigned) long long, so the entire expression is evaluated in 64-bit arithmetic.