Thread Synchronization

Me and my senior design team are trying to CUDA-ize some PIC code for plasma simulations. The plan I had at first was to just copy all our arrays into global memory and have each thread work on a specific subsection based on it’s id, ya know, your basic grid partitioning stuff. However, we will need barriers and mutexes/semaphores for this and all the podcasts on Nvidia’s website kept saying that threads weren’t designed to synchronize globally, ie: across multiple thread blocks. Is there anyway around this without coming up with some really complex scatters and data shuffling back and forth to each blocks cache region? An obvious solution I guess is to use only one thread block but that would kill speed-up. And yes if you haven’t picked up on it, I am a complete CUDA n00b so it’s possible that I’m just misunderstanding stuff I guess.

There is one easy-to-use global barrier in CUDA, and that’s the point between the completion of one kernel and the start of the next. You are guaranteed that all writes to global memory made in one kernel are completed and visible to the next kernel that runs. The overhead of a kernel launch is pretty short (15 microseconds is a number I’ve heard quoted), so this is not a huge burden unless you need very frequent barrier events.

So, if you can possibly structure your code as a sequence of kernel calls, then you will have the most pain-free approach to global synchronization in CUDA. If that doesn’t work, there are other options to avoid global synchronization, and as a last resort, there are implementations of mutexes floating around in the forum (use Google to search rather than the forum search engine, which is terrible, and read the thread carefully because there are also a lot of incorrect implementations that have been posted). However, the performance of such mutexes is positively horrible for highly contested locks.

I should mention, one trick to avoid write conflicts is to reconsider the partitioning in your code. For example, a lot of people try to assign one thread to each input element, but then find that threads might need to write to the same location for the output. If it is possible to flip the algorithm around, and give each thread ownership of an output location, then the problem goes away. The penalty for this is now you might have to read input elements multiple times.

A pathological case of this is histogramming. You could assign a thread to each output bin, and let each thread scan the entire input stream looking for elements that fall into its bin, incrementing a private counter, and then writing the count to its output cell at the end. This is a completely lock-free algorithm which unfortunately requires reading the input stream once for each histogram bin. This is not a good way to do histogramming, even in CUDA, but I bring it up to demonstrate the transformation between input and output partitioning.

Can you say more about what you want to do? Someone might know of an existing CUDA algorithm which looks like it.

Basically in the sequential version what you have is a number of grid points on your grid. The algorithm is sort of iterative, each iteration represents a passing of time. Particles (macroparticles of ions and electrons) are injected at the the top and the bottom of the grid. Based on the current position of each particle, the charge density, then potential and e-field are calculated for each grid point. Then based on this e-field, the particles are then moved accordingly, then it goes back to the injection phase. Particles can fly out the top and bottom of the graph, but loop around the sides to represent an infinitely wide 2-d cross section.

Our goal is to parallelize this in CUDA, then using openGL we will create a nice little visual of all our particles moving around. We may also end up auto-generating some graphs. I can probably decompose this into kernels like you said and barrier things that way, but I’m gonna have to come up with a really creative way achieving mutual exclusion. Seems kinda a large oversight that this functionality is not present in CUDA. I mean after all, it did say that CUDA was able to perform atomic operations on global memory. Mutexes and semaphores are cornerstones of most shared memory code that I’ve seen.

I just got a paper from my professor that is sponsoring our team that has some PIC work in CUDA that is similar to our project so maybe he will discuss these sorts of issues.

Ok, this problem does sound conceptually very similar to things people do in 3D molecular dynamics simulations (though much simpler in your case). For example, I know HOOMD (now HOOMD-Blue) has been mentioned several times in the forum:

http://codeblue.umich.edu/hoomd-blue/

I doubt you can extract any tricks from a package of that size easily, but perhaps MisterAnderson42 (the lead developer) will drop by this thread and comment. :)

Also, to help calibrate my thinking, how many particles and how many grid points roughly are you talking about here? Depending on the relative numbers, I can imagine the following algorithm (this could be nonsense, but hopefully gives you some ideas):

Kernel 1:

  • Assign each thread to a grid point

  • Each thread loops over all particles, computing their contribution to the E-field at its point

  • Each thread writes out the net E-field to its grid location in global memory

Kernel 2:

  • Assign each thread to a particle

  • Each thread looks up the appropriate grid point to determine the net E-field at its particle location

  • Each thread writes back to memory the location of the particle for time step N+1.

Kernel 3 or CPU:

  • Deal with particles that leave the grid and/or introduce new ones. This sort of compaction and appending is probably easiest done on the CPU, but if this turns out to be a bottleneck, you might find yourself needing to do a sort on the GPU.

Go back to Kernel 1 for next time step.

So the subtleties I glossed over have to do with optimizing memory accesses (which are far more expensive than any calculations you will be performing). For example, in Kernel 1, a block of threads could cooperate to load several particles and charges into the shared memory so that all threads on that multiprocessor could read them. That cuts down your global memory transfers by a factor of 100x, easily. In Kernel 2, you have the problem that threads will be making random memory reads, which will get much lower bandwidth than large contiguous reads. Don’t know any way to speed that up if your grid size is large. (Though, again, hopefully smart people will chime in here and propose something.)

Assuming the above actually does what you want, no mutual exclusion was required, at the expense of having to read the particle positions more than once. Now, depending on the scale of things, this might turn out to be great, or be horrible compared to the CPU. (Mostly depending on how effective the CPU cache is for your problem size.)

Anyway, I hope that gives you some ideas, if nothing else.

It is one thing for CUDA to support atomic operations (though probably not of the sort you want), but it is another thing for them to be fast. The problem here is that CUDA devices are designed to handle thousands or tens of thousands of simultaneously running threads. (Seriously, even though there are only 240 stream processors on a chip. You get best performance by oversubscribing them by a factor of 10 or more so that memory latency is hidden.) Mutual exclusion simply doesn’t scale that well unless the probability of two threads hitting the same lock is small.