Approach to handle random occurence/frequency/size of global memory operations per thread

Imagine you have a number of threads each responsible for the simulation of a distinct particle( 3D with position(x,y,z) and direction(l,m,n) with some other info).

There are regions in this simulation world where these particles can generate secondary particles based on some random process. Some 3D regions are more probable to generate secondaries than others, and once it has been determined that there has been an additional particle generated, I need to save a bunch of data related to that generation at that point in the simulated process.

My previous version allowed only for the generation of a single secondary particle, so each thread either generated 0 or 1 secondary particles. At the end of simulation I would save data for both the primary and secondary particles, with a ‘dummy’ value for the secondary if one was not generated.

Now I need to implement a version where there can be (theoretically) up to 60,000 secondary particle events generated from a primary during the course of each particles simulation. A primary particle could be in a ‘volatile’ region where each movement in that region has a high probability of generating a secondary event.

Once a secondary particle event has been generated, I do not need to simulate that secondary further, rather just save a bunch of information about the state of things at that exact point in the simulation.

So the bottom line is that I have no realistic upper limit on the number of secondaries, and cannot use pre-allocated registers or shared memory for local storage and ordering before writing out the global memory.

The other GPU versions of this type of simulation I have seen use a single 32 bit integer global value which is updated via atomic operations. This atomic operation would return an integer value which is the next available index in global memory for the secondary particle event information to be stored. A thread in that implementation which generated a secondary particle event would have to diverge from the rest of threads in a block, update an global counter,save about 120 bytes of information to global memory, then continue the simulation.

Such implementations are slow and I am trying to create a more efficient implementation.

Overall the probability that a large number of secondary events will be generated is small, but I was told not to rely on that fact and assume each thread(primary particle) could generate a secondary particle each step in the simulation.

Since each thread’s primary particle in a thread block could be in very different regions of the 3D world, it is unlikely that multiple threads in a block/warp will simultaneously need to store information related to their generated secondary particles at the same point in the simulation process.

A toolkit like Geant4 (CPU based) simulates particles serially one by one, and if a secondary particle is generated it saves that info and later simulates that secondary particle in a subsequent run.

My current CUDA application operates much differently so that approach is not possible. Already I am close to my resource limit per thread block (shared memory and registers), so more than two potential secondary particles will be a challenge with the current structure.

Ultimately I can use global counters and disparate global memory updates as well, but would like to test other potentially more efficient methods first.

Anyone have any ideas or links to papers which address this problem in a parallel architecture?


One approach I am trying is to save the required data upon the first secondary generation event, as well as the location and other related data of the primary particle at that point. Then I would set an “incomplete” flag and suspend that thread’s primary simulation at that point. The data related to the single secondary would be saved with other threads own result info upon completion of the thread block work.

Then (assuming all primary particles are not done) the remaining incomplete primary particles will be subject to another launch, starting off at the last state of the secondary generation. Then that primary particle could again generate a secondary particle, save that data, etc. This would be repeated until all primary particles were finished. In the meantime I will gradually be accumulating the info needed from the secondary events.

Once all is done then there is a large final step which considers all info from all primary and secondary particles, and accumulates into a final result.

That should prevent disparate random memory accesses by doing all memory operations in lock-step coalesced fashion enforced by __syncthreads(). The downside is that I need more bookkeeping and I am going to have to operate from a do{…}while() loop from the host invoking multiple launches until all primary particles have finished and all the secondary information has been saved.

Still interested in other ideas, as I am determined to avoid a global counter accessed via Atomic operations.

…and i thought i have difficult problems to solve

i am surprised njuffa has not come to your aid already, as he does math on infinity (without even having to concentrate much)

“One approach I am trying is to save the required data upon the first secondary generation event, as well as the location and other related data of the primary particle at that point. Then I would set an “incomplete” flag and suspend that thread’s primary simulation at that point. The data related to the single secondary would be saved with other threads own result info upon completion of the thread block work.”

if i interpret this correctly, would this not potentially lead to a few threads - one or two - ‘arresting’ the block, as they fail to ‘obtain’ secondary particles?

nevertheless, the approach would likely favour small(er) blocks, even single warp blocks; not so?

if you could negate the different arrival rates of threads at secondary particles - the varying probability of particles to hit secondary particles
i would even consider using half warps, such that i know each thread has some storage space, such that it can push at least 1 hit into temporary storage
threads then only need to write out data when they hit a subsequent secondary particle, and their storage is full
you should already see a better probability distribution, as it now becomes the probability that other threads of the block would have a secondary particle to write, when (at least) one thread’s ‘cache’ is full, and it must write at least 1 entry to free space to continue

the other hypothetical that comes to mind is that, if the cost of simulation is cheap relative to the cost of writing out the data, one might consider ‘scanning’ the simulation world first, and build some form of secondary particle distribution plot
the 1st run merely counts secondary particles, perhaps even per region; the 2nd run then writes the secondary particles
this would already significantly reduce array size requirement uncertainty

Your observations are correct,and I do use small threads blocks of 64 threads in this application.

In general with Monte Carlo simulations thread divergence is an issue(due to the random processes which determine the position, direction energy etc), but my current implementation (without the secondary florescence events) does manage to reduce divergence down to a tolerable level.

I just want to modify the current implementation so that these secondary events can be cached locally and then written out to global in a order coalesced fashion. The mentioned approach is my first attempt, so after some rigorous testing I may determine if that is the solution to this issue.

I am not sure I agree that this is a bad way of going about things. This approach is robust (can deal with any number of secondary particles), and as long as the probability of secondary particles is very low (as you seem to imply) it should be high performance. So in sum, it would handle the common case quickly, and the uncommon cases correctly, with gradual performance degradation. The fact that other resources (registers, shared memory) are maxed out already also would indicate that this is a good solution, as there is really no other way for the data to go but global memory.

With similar issues I have found it extremely helpful to get my hands on a least one real-life data file considered representative to inform further investigation of relevant tradeoffs. Currently it is impossible to tell whether the most pronounced bottleneck of this simple scheme would be (1) thread divergence, (2) atomics, (3) memory bandwidth consumption.

For example, if the major bottleneck turned out to be atomics, one could consider operating multiple “tapes” in parallel to record the secondary particles to reduce contention among threads. Pushing that to extremes, one could record the secondary particles in a (thread-)local array, i.e. use one “tape” per thread.

While a primary particle is in a ‘volatile’ region then many secondary events can be triggered in succession, so much depends on the number of ‘volatile’ regions and the probabilities associated with those regions.

I will test both implementations to verify correctness and to determine a performance comparison.

I would suggest trying to get your hands on a real-life data set because the amount of locality in the generation of secondary particles and its performance impact is not well understood. Simplistic synthetic data often does not capture the behavior realistically, usually because the statistical distribution does not matched the actual stochastic process well enough. Generating good synthetic data can be surprisingly hard.

One thing I have learned the hard way is not to dismiss a “dumb” solution out of hand, before actually having tried it.