Need an advice about using CUDA for a certain task

I’m not very experienced at programming. But I’m capable of writing some simple code for simple tasks. Recently I came over a task to simulate interaction of many small particles with one common substance. I’ve read some about CUDA before, and I thought that maybe it might work. I’ll describe the task as short as I can.

We have:
~1000 small particles
1 common substance

Those small particles can be in 4 different states. They can change their state with some probability at every time step. All small particles start in state 1 and can interact with common substance when in state 4. After interacting they go to state 3.
Here’s a scheme of possible transitions:

The simulation starts at 1 and goes e.g. 1000 steps. At each step for each small particle the program has to decide if it makes a transition to another state or interacts with the common substance (generate random number and check if it hits a certain interval).

I read that one multiprocessor can do only one operation for all threads. So the question is - can I program several multiprocessors, each for a specific operation? For example, one for 1->2 transition, another for 2->1, third one for 2->3, etc. Is it possible to do something like this on GPU? Or maybe there’s some other way to do it. Willing to hear any advice. Thanks!

If the numbers you have given (~1000 particles and ~1000 time steps) are correct, I don’t think you actually need to dive in to CUDA for this one if you don’t have any extreme timing requirements. A basic approach in the language you are comfortable with should be able to run this simulation in a very short time.

If the numbers you have given (~1000 particles and ~1000 time steps) are correct, I don’t think you actually need to dive in to CUDA for this one if you don’t have any extreme timing requirements. A basic approach in the language you are comfortable with should be able to run this simulation in a very short time.

The example was simplified just to make it short to describe the main aspects ) But time is not the only problem.
The interaction with particles changes the state of the common substance. And every interaction should affect the probability of further interactions (well, maybe every 10-15 interactions). It would be great, if it was possible simulate the behavior of all the particles simultaneously. And that’s why I want to find out if it is possible to make it, using CUDA.

The example was simplified just to make it short to describe the main aspects ) But time is not the only problem.
The interaction with particles changes the state of the common substance. And every interaction should affect the probability of further interactions (well, maybe every 10-15 interactions). It would be great, if it was possible simulate the behavior of all the particles simultaneously. And that’s why I want to find out if it is possible to make it, using CUDA.

Unless the calculations in the state transitions are very long then this sounds like it would run on CPU in seconds and unless you really need it to run in a fraction of a second there is no need to use a GPU.

1000 particles is tiny, at one thread per particle then could be done with only 2 blocks

3 questions

  1. Am I right in thinking that it is essential that all particles complete a time step before any start the next timestep, ?
    if so then interblock communication will be a significant design factor.

  2. are roughly 25% of particles in each state or is there a huge bias to certain states ?

  3. what nVidia GPU do you have ?

Cheers,

Unless the calculations in the state transitions are very long then this sounds like it would run on CPU in seconds and unless you really need it to run in a fraction of a second there is no need to use a GPU.

1000 particles is tiny, at one thread per particle then could be done with only 2 blocks

3 questions

  1. Am I right in thinking that it is essential that all particles complete a time step before any start the next timestep, ?
    if so then interblock communication will be a significant design factor.

  2. are roughly 25% of particles in each state or is there a huge bias to certain states ?

  3. what nVidia GPU do you have ?

Cheers,

  1. Yes, or at least they should be synchronized about every 10-20 steps, because I have to update the probability of interaction of particles with the common substance. Maybe it should be done less often - depends on amount of particles (should be somewhere around 10 - 100K) and the rate of interaction with common substance, which depends on quite a bunch of parameters, but it definitely should be done.

  2. Most likely the majority of particles will stay in one state. They will all start in the 1st state and after a while, most of them will be distributed between several states (for this example let it be 3d and 4th states).

  3. I have 8600GT, but soon I’ll have access to a tesla GPU.

Also I’ve been thinking of algorithms that would suit GPU architecture… And as soon as they are good at working with arrays of data, I thought that:

All the particles can be presented as an array of states (like [1, 3, 1, 2, 4…1])
For each state there can be only certain transitions, so I can make an array of transitions for each state (like [0,0,0,0,2,0,0,0,0,0,0,-1,0…0]), where I divide the probability of each transition by the number of particles and refill this array every n (e.g. n=100) steps. And after those n steps I update the state of the common substance and the probability of its interaction with particles. Then, depending on the state of the particle, the program picks the right array of transitions and adds the value of the current cell to the value of the cell in the array of states. So “0” will mean that the particle experiences no change in its state, 1 means that it changes its state to the next one (e.g. 2->3 transition), -2 means e.g. 4->2 transition. Though that means, I’ll have to make quite big arrays of transitions and maybe pick the numbers not successively, but maybe i*n+k (where “i” is the number of the particle (it’s place in the array), “n” is the number of steps before synchronization and “k” is the current step). In that way, device will have to operate with arrays of data… Maybe that’ll work better?

Thanks,
The 10-100k particles is very important as it makes a GPU based approach more favourable than with fewer particles.

There are always several ways to do something, the simple approach is one thread per particle. probably 256 or 512 threads per block. each kernel call does say 10 time steps and you put the kernel call in a loop with a SynchThreads in the loop as well. That implicitly keeps all threads at in the same set of 10 time steps (even if you had millions of threads) I think you will need to use atomic operations to update the ‘substance’. Global atomic operations are costly (GPU can do about 2million a second http://forums.nvidia.com/index.php?showtopic=159217 ) I think you can reduce that by copying the substance into shared memory before each block runs its 1st step, using shared atomic operations on that copy, and then using a global atomic operation to update the master of the substance in global memory.

However 100k particles times 1000 time steps times say 100 operations per particle per time step is 10^10 operations so a CPU based approach could probably do the whole task in a few seconds. ( CPU will have the advantage that the ‘substance’ will probably stay in a register making it cheap to update)

Thank you! I’ll have to study that carefully… Maybe you can advice some example code, that is close to this task?

P.S. As I said before, that example, I’ve given, was simplified. Actually I’d like to reach miliseconds of simulation time, with a step about 0.5 - 1 ps, so it will be about a billion of steps. Depending on different parameters, it takes several hours to compute this on a CPU.

Key point here it is if particles could be processed in parallel or not. I.e. if p1 behaviour depends on p0 change. “Life” game could be easely transported on GPU btw.

Indirectly via the common substance. But the state of the common substance changes quite slowly, so it can be updated every 100-200 steps. For these 100-200 steps the particles can be considered to be independent. They surely can be processed in parallel. But I’ve read that a multiprocessor can do only one operation to all it’s threads at a time, so there is some sort of difficulties, if the threads have some branches. And the particles should change their state randomly. So the question is if it is possible to do this on a GPU?

Branches are less efficient, but most high end GPUs have quite a bit of instruction throughput to spare, so moderate branching is not a total showstopper. The hardware is also pretty good about recombining divergent warps after branches have finished. Random memory access tends to be a far more serious performance killer, unless your working set of random-access data fits in the L2 cache of a Fermi-based GPU.