find point closest to cursor

I have a bunch of points and I need to find the one closest to the cursor but if each point has its own thread how will it know if its the closest without each one comparing itself to all the others which seems rather suboptimal?
Would it be best to return the distance between each point and the cursor to the CPU and find the closest there or is there an efficient way of doing this in CUDA?

__device__ float minbright = 0.5F;
__global__ void Compute(float2* p0, float* a, int** n, int count, double time, float2 mouseloc){
    const auto i = blockIdx.x * blockDim.x + threadIdx.x;
    if(a[i]>minbright){
        if(a[i] -= time < minbright) a[i] = minbright;
    }
    const auto dx = mouseloc.x - p0[i].x;
    const auto dy = mouseloc.y - p0[i].y;
    const auto ds = dx * dx + dy * dy;
    //which thread has the smallest ds value
}

Finding the minimum (or maximum) of a bunch of computed values (e.g. distances from a particular point) is an example of a type of problem called a reduction.

It can be implemented in parallel efficiently using canonical methods:

https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

You can also use libraries e.g. thrust to do it:

https://stackoverflow.com/questions/10124528/parallel-reduction-on-cuda-with-array-in-device

It can also be done trivially using atomics:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicmin

atomicMin was my first thought but it cant be used with floats.
I can’t use thrust because I’m using ManagedCuda from C#
That pdf is confusing as hell but I’m working on it.

More-or-less arbitrary atomics can be “constructed” using the example given in the programming guide.

https://stackoverflow.com/questions/55140908/can-anybody-help-me-with-atomicmin-function-syntax-for-cuda/55145948#55145948

I think specifically for float, you can do a cast and the ordering is preserved:

https://stackoverflow.com/questions/17399119/cant-we-use-atomic-operations-for-floating-point-variables-in-cuda

Does this look right?

__device__ static float atomicMin(float* address, float val){
    const auto address_as_i = reinterpret_cast<int*>(address);
    int old = *address_as_i, assumed;
    do{
        assumed = old;
        old = atomicCAS(address_as_i, assumed, __float_as_int(fminf(val, __int_as_float(assumed))));
    } while(assumed != old);
    return __int_as_float(old);
}
__device__ float minbright = 0.5F;
__device__ auto minimum = 1.0F;
__global__ void Compute(float2* p0, float* a, int* n, int count, double time, float2 mouseloc){
    const auto i = blockIdx.x * blockDim.x + threadIdx.x;
    if (a[i] > minbright){ if ((a[i] -= time) < minbright) a[i] = minbright; }
    if(i == 0)minimum = 1.0F;
    const auto dx = mouseloc.x - p0[i].x;
    const auto dy = mouseloc.y - p0[i].y;
    const auto ds = dx * dx + dy * dy;
    atomicMin(&minimum, ds);
    __syncthreads();
    if(minimum != ds) return;
    a[i] = 1.0F;
}

I’m not sure this will do what you want:

__syncthreads();
    if(minimum != ds) return;
    a[i] = 1.0F;

__syncthreads() is a block-wide barrier, not a grid-wide barrier.

I’m not sure this is right either:

if(i == 0)minimum = 1.0F;

there is no defined thread execution order in CUDA. Not sure why you need that line anyway.

I know __syncthreads is only block wide but I cant find a grid wide sync.

I just need minimum to be the same variable for all threads and I need it to start at 1.0.

If you know what I should be doing please tell me.

In many cases, I can’t really tell you what you should be doing because I don’t know your intent. I can study your code to try and guess at intent, but that is error-prone.

This:

__device__ auto minimum = 1.0F;

will accomplish that, without needing this:

if(i == 0)minimum = 1.0F;

(so I would delete that line of code; it can have unintended bad effects depending on block execution order)

The one place where this will fall apart is if you call the kernel repeatedly. The initialization of minimum will only apply to the first kernel launch. Thereafter, before launching each and every subsequent launch of Compute, in host code I would do this:

float my_init_val = 1.0f;
cudaMemcpyToSymbol(minimum, &my_init_val, sizeof(float));

sorry, I can’t tell you how to do that in C#/managedCuda, but I’m sure such trivial concepts are documented in the managedCuda site.

Regarding the syncthreads usage, a global sync barrier is possible in CUDA using either a new kernel launch (boundary) or else via CUDA cooperative groups. I’m not sure cooperative groups are available in managedCuda (maybe). However before going down either of those avenues, I would question what you are trying to accomplish. Are you trying to identify the point/thread that produced the actual global minimum distance?

If so I would go back to reduction coding. It’s possible to write a min-reduction that captures both the value and the index/thread that produced that value:

https://stackoverflow.com/questions/38176136/finding-minimum-value-in-array-and-its-index-using-cuda-shfl-down-function

(and there are other examples on stack overflow that don’t use warp shuffle)

If you want to do it with atomics, my best suggestion would be to write your own custom atomic:

https://stackoverflow.com/questions/17411493/how-can-i-implement-a-custom-atomic-function-involving-several-variables/17414007#17414007

I don’t think managedcuda can access global variables but removing the if seems to give a better result.

I cant use the cooperative_groups grid sync because that requires TCC mode it doesn’t work in WDDM mode.

I just need to find which p0[i] is closest to mouseloc and set the corresponding a[i] to 1.0F.

Can I just atomicAdd a global variable and wait till it reaches gridDim.x*blockDim.x to sync the grid?
I tried

atomicAdd(&synccount, 1);
const auto threadcount = gridDim.x*blockDim.x;
while(synccount<threadcount){}

but the kernel doesn’t return and the device resets.

mark the synccount variable as volatile

then when you use it in the atomicAdd, you will have to cast it back to a non-volatile type.

and this method only has a chance of working if all the blocks are physically resident on the GPU SMs. If you are launching more blocks than that, it will deadlock.

Even then there is no guarantee in CUDA that this is correct programming. Which is one reason why cooperative groups were introduced.

Well that gives me the result I’m looking for and the most threads I can use without my Titan V hanging is 131072 with a 128 block size giving 1024 blocks.
It will do for now but I will work on something more (cant think of word).

Edit0: Actually make that 102400 threads with 800 blocks of 128.
Edit1: Now 1200 blocks of 128 seems to be working but sometimes it crashes.

The reduceMin6 kernel from the stack overflow post you linked above uses __shfl_down which is deprecated so I have to use __shfl_down_sync instead but what do I set the mask to?
Also how do I use the reduceMin6 kernel?

I had the idea of using __syncthreads() to narrow down the number of waiting threads to 1 per block but it seems that the number of blocks waiting is the problem not the total number of threads.

if (mouseloc.x != 0.0F && mouseloc.y != 0.0F){
    const auto dx = mouseloc.x - p[i].x;
    const auto dy = mouseloc.y - p[i].y;
    const auto ds = dx * dx + dy * dy;
    if(threadIdx.x == 0) minimums[blockIdx.x] = 1.0F;
    atomicMin(&minimums[blockIdx.x], ds);
    __syncthreads();
    if(minimums[blockIdx.x] == ds){
        atomicMin(const_cast<float*>(&m), minimums[blockIdx.x]);
        atomicAdd(const_cast<int*>(&synccount), 1);
        while(synccount < gridDim.x){}
        if(m == ds && a[i] == minbright){ a[i] = 1.0F; }
    }
}

I give up, this simple task is impossible!

I did it! I did the impossible!

__global__ void FindNearest(float2* p, float2 mouseloc, volatile float* m, volatile int* mi){
    const auto i = blockIdx.x * blockDim.x + threadIdx.x;
    __shared__ float dist;
    if(threadIdx.x == 0) dist = 1.0F;
    const auto d = abs(mouseloc.x - p[i].x) + abs(mouseloc.y - p[i].y);
    atomicMin(&dist, d);
    __syncthreads();
    if(d == dist){
        atomicMin(const_cast<float*>(m), dist);
        if(*m == dist) *mi = i;
    }
}

Edit: Just realized it didn’t need the while loop!
Another edit: using abs seems faster than dx * dx + dy * dy

I noticed the above is not perfect, I think there is a minor race issue between those last 2 lines, is it possible to lock m from line 9 via line 10?

Yes, abs is going to be faster but it is incorrect for finding the distance. The dxdx + dydy option is slower but it has the added bonus of being correct. You don’t need to determine the square root yet because distance comparisons can and should be done with the magnitude squared.

I wouldn’t call the race in your code “minor”. You need to make sure you initialize first, then do atomic operations after that.

Is 1 the largest distance you’ll ever possibly encounter?

This works just fine

__global__ void FindNearestInternal(float* px, float* py, volatile float* d, float2 mouseloc, volatile float* mn){
    const auto i = blockIdx.x * blockDim.x + threadIdx.x;
    __shared__ volatile float dist;
    dist = 2.0F;
    d[i] = fabsf(mouseloc.x - px[i]) + fabsf(mouseloc.y - py[i]);
    __syncthreads();
    atomicMin(&dist, d[i]);
    __syncthreads();
    if(dist == d[i]){
        atomicMin(mn, dist);
    }
}
__global__ void SetNearest(volatile float* a, volatile float* d, volatile float* mn){
    const auto i = blockIdx.x * blockDim.x + threadIdx.x;
    if(d[i] == *mn) a[i] = 1.0F;
}
__global__ void FindNearest(float* px, float* py, volatile float* a, volatile float* d, float2 mouseloc, volatile float* mn, unsigned long griddim, unsigned long blockdim){
    *mn = 2.0F;
    FindNearestInternal<<<griddim, blockdim>>>(px, py, d, mouseloc, mn);
    cudaDeviceSynchronize();
    SetNearest<<<griddim, blockdim>>>(a, d, mn);
}