seems I’m in over my head, but maybe I’m missing some basics.
Let me describe what the code shall do
Calculating a kind of Huygens-like reflection of a geometric body, one has to place point scatterers on the surface and sum up their contribution to the return signal.
Depending on the wavelength the number of scatterers is quite large; I’d like to enable an arbitrary number but would be satisfied with 1e12.
Having that much scatterers there is no way to hold information on those in (CPU) memory, so their locations etc, are calculated on the fly in a while-continue loop. Thus creating the geometry incrementally I don’t even know beforehand, how many scatterers I’ll end up with, exactly.
A CPU-implementation of this works just fine (if slow), compiling the same code for the GPU works, too, but on medium sized and larger geometries the kernel takes too long to complete and terminates itself.
Thus for now I’ve made parallelization by splitting the whole set of scatterers into subsets of arbitrary size and thought of having one thread for each subgeometry.
Problem is: each scatterer’s reflection will contribute to one cell of the return signal depending on its distance to the transmitter. There is no way to determine beforehand, which scatterers will contribute to a given cell. Even if I’d sort the scatterers according to their distance, the information gathered would be to large to fit in (even CPU-) memory.
Thus: scatterers from different threads will contribute to the same index of the return signal, the simple line returnsig[index] = returnsig[index] + contribution; will fail because the modification to returnsig by one thread is not seen by another.
This is demonstrated by using the emulation mode: just counting the scatterers by modifying the code above to returnsig = returnsig + 1; shows that emulated threads catch all facets while real GPU threads miss more than 99% (depending on subset size).
It is no option to give each thread it’s own return signal and add up, afterwards. Some approximate numbers from my recent test example: length of the return signal: 72000, total scatterers: 1e9, subgeometries/threads: 2000. Even those modest sizes give me a CUDA error out of memory when allocating on the device.
I have way too many scatterers to even allocate a single int targetindex[NUMBEROFSCATTERERS].
I have too many scatterers to run all of them in one single thread because then this thread will take too long to complete.
If I distribute the scatterers over several threads (and several does not mean, I can limit them to something like number of processors), I’ll have the problem that more than one thread will modify the same index in the return signal.
I’m out of ideas, here. I hope I’ve missed something basic that will allow several threads to consistently modify the same variable.
Or you have a bright idea how to organize the problem in a different way…
Thanks in advance