How to deal with multiple threads writing to the same GPU memory location?

I’m performing some operations on particles/objects in 3D space, my kernel looks something like this:

__global__ void mykernel(float *x, float *y,  float *z,  float *data_in,  float *data_out){

//perform work here

for loop{

   data_out[f] = something[g]; 

}

}

The location of the index ‘f’ in data_out is dependent on the particle’s location. Currently, one thread performs work on one particle. If two or more particles are close together in 3D, odds are, the ‘f’ values for them will be the same, and the threads associated with those particles will be trying to write to the same location at once. Now, if I run the kernel in its current form, it doesn’t crash, but I will often get different values for the same input data each time I run it.

I can sort of get around this by writing a second kernel, where the first kernel writes the data for all particles into a large array, and the second kernel adds all the data sequentially. But is there a better way?

Not really. Global atomics would “work” but they’re for occasional use only, not something to base most of your mem accesses on. They’re horribly inefficient.

Maybe you can completely reinvent your algorithm so that it’s not a problem?

But honestly, outputting separate results and then reducing them sounds like it’d be fairly efficient. (You perform a few dozen operations before issuing the write, correct? If it’s between a few dozen and a few thousand, you should be good, but instead of making two separate kernels, get rid of the launch overhead and just have your kernel flip between two modes.) Fairly efficient is a lot better than horribly inefficient (or non-deterministic), and that’s pretty good in sum.

I might be able to rearrange it so that each thread operates on one box at a time, but first I would need to go through each particle and determine which box it’s in.
In that case, then, different boxes will have different numbers of particles. But then the question becomes, if M threads start at the same time, and N of them finish early (e.g., they had 1 particle instead of 2) will the scheduler start up N other threads to take their place?

right. I thought about such schemes, but figured it’d probably be less efficient than just printing all results and combining them afterward. (Though with the right algorithm it might work well.) On CUDA you want a regular, symmetric algorithm.

Blocks and warps can finish early, but not threads. If only one thread in a warp finishes, it just idles until its warp comrades finish too. Also, if the whole warp finishes early, the other warps fill up the GPU but no actual new warps will start until the whole block finishes. Then a new block is loaded.