Right now, I am trying to parallelize a process where I add particles with continuous coordinates to a discrete grid within our simulation box (roughly 8 grid points for each particle). The current solution is spawning each thread based off the particle id. Since the list of particles is unsorted, there is no guarantee about whether any two particles are close to each other, which means there is no guarantee whether the threads (within the same block or not) will be writing to the same grid points which could mean dataraces.
Therefore, we use atomicAdd for each particle which is really slowing down the code (adding our particles to the grid is on the order of our cuFFT routines).
There is an idea to flip the routine entirely and launch based off the number of grid points and have each grid point loop through the amount of particles and then add the values to the grid point. This has the advantage of ensuring that there is no data race since each grid point will write/add its own value. The downside is that we have many more particles compared to grid points so we will launch fewer kernels.
Since I’d like to compare launches based off grid points vs particle, I’d also like to improve the kernels launching based off the number of particles. Is there a way using
__shared__ memory or mutexes to write the particle coordinate data in a safe way?
For example, can the threads within a block launch with a
__shared__ memory that is the entire length of the grid using something to prevent dataraces and then have the blocks add their respective shared memories into a single global array?
Would it make sense for each particle to have a copy of the entire array and then just sum it up at the end? (How would one do that? Two kernel calls?)
Thanks in advance