In many cases, I can’t really tell you what you should be doing because I don’t know your intent. I can study your code to try and guess at intent, but that is error-prone.
__device__ auto minimum = 1.0F;
will accomplish that, without needing this:
if(i == 0)minimum = 1.0F;
(so I would delete that line of code; it can have unintended bad effects depending on block execution order)
The one place where this will fall apart is if you call the kernel repeatedly. The initialization of minimum will only apply to the first kernel launch. Thereafter, before launching each and every subsequent launch of Compute, in host code I would do this:
float my_init_val = 1.0f;
cudaMemcpyToSymbol(minimum, &my_init_val, sizeof(float));
sorry, I can’t tell you how to do that in C#/managedCuda, but I’m sure such trivial concepts are documented in the managedCuda site.
Regarding the syncthreads usage, a global sync barrier is possible in CUDA using either a new kernel launch (boundary) or else via CUDA cooperative groups. I’m not sure cooperative groups are available in managedCuda (maybe). However before going down either of those avenues, I would question what you are trying to accomplish. Are you trying to identify the point/thread that produced the actual global minimum distance?
If so I would go back to reduction coding. It’s possible to write a min-reduction that captures both the value and the index/thread that produced that value:
(and there are other examples on stack overflow that don’t use warp shuffle)
If you want to do it with atomics, my best suggestion would be to write your own custom atomic: