programs using BIG, COMPLEX threads Request for references to relevant work: cellular automata / A.I


Most applications described on all these “nearby” web sites, forums, classes, tutorials, etc. involve relatively small thread-programs. Things rather closer to array-processing than to the kind of use I am exploring.

Imagine using GPU / CUDA / Tesla C 1060 for “Life”

Then imagine scaling the complexity of the rules and “organisms” 'way 'way up. And understand that the kinds of speed increases made available by CUDA for graphics is (possibly) irrelevant. (The code running for each automaton need not be more than a page or two, but each automaton would then be calling functions and accessing truly global memory all the time. Don’t be horrified! This the opposite end of the spectrum from array-processor / graphics use of parallel processing.)

Who out there is working on things like this? Can anybody refer me to literature or community members? and please forgive my ignorant newbie-ish-ness. My school’s CS dep’t just ordered our very first CUDA devices.

Please respond to my email as well as to the forum if it’s at all convenient.


Dr. Chris Lanz
SUNY / Potsdam

Sounds a bit like Conway’s game of life, but with a far more elaborate rule set and state machine. You may run into two issues:

-branch divergence (which is bad for performance), if your rule set is complex and contains many conditionals

-bandwidth limitation (due to hammering global memory). Try to coalesce your memory access whereever you can.

If you work on a truly large grid, then compared to a CPU you will probably see speed-ups in the order of the ratio of the available bandwidth (e.g. 150 GB/sec vs 20 GB/sec for CPU), so roughly a speed-up of factor 8. To gain more speed-up on the GPU, you might have to transfer entire blocks of cells into shared memory and operate on them locally.


Depending on your automaton’s interaction range, your problem becomes more or less appropriate for GPUs.

For example, if the interaction range is very short (only adjacent or very close particles/cells interact with each other) you have a situation like Smoothed Particle Hydrodynamics (SPH), which is used in the SDK example “particles.” If each interacting automaton is influenced by all faraway elements, but with an influence that decays with distance, you have something like the “nbody” example. Both of these achieve very good speedups vs. CPU versions.

My own research is a variant of the nbody-type code, but each individual interaction is somewhat more complicated (~70 FLOPs and two conditionals). The effect of that, though, is a higher arithmetic intensity than nbody, which leads to even better GFLOP/s as a fraction of theoretical, and >100x speedups compared to CPU.

If your automatons must access large amounts of global memory, though, or, as the other poster noted, require many conditionals, then your performance improvement will be severely limited. I’m curious to see what your application is, specifically.

We just got our Teslas, and I’m not exactly sure which of my interests will be expressed in CUDA. We’ll be starting with small automatons in swarms that interact little with each other , a lot with the background environment (16k is probably enough for the environment), and run a limited number of fairly small functions.

I’m just finishing a LaTex-edited paper. It’ll be on-line as a pdf soon. In the meantime if you’re serious goto’s easy to read on MY browser, but apparently it’s a big problem elsewhere.

I am waiting for parallel processing management at the level of, say, 256 8080 processors. Tesla is still just a step above array-processing, from my point of view.

FYI, the particles sample doesn’t actually use SPH, just simple collisions.