Need an advice about using CUDA for a certain task

elektroneg · July 1, 2010, 12:11pm

I’m not very experienced at programming. But I’m capable of writing some simple code for simple tasks. Recently I came over a task to simulate interaction of many small particles with one common substance. I’ve read some about CUDA before, and I thought that maybe it might work. I’ll describe the task as short as I can.

We have:
~1000 small particles
1 common substance

Those small particles can be in 4 different states. They can change their state with some probability at every time step. All small particles start in state 1 and can interact with common substance when in state 4. After interacting they go to state 3.
Here’s a scheme of possible transitions:
External Media
The simulation starts at 1 and goes e.g. 1000 steps. At each step for each small particle the program has to decide if it makes a transition to another state or interacts with the common substance (generate random number and check if it hits a certain interval).

I read that one multiprocessor can do only one operation for all threads. So the question is - can I program several multiprocessors, each for a specific operation? For example, one for 1->2 transition, another for 2->1, third one for 2->3, etc. Is it possible to do something like this on GPU? Or maybe there’s some other way to do it. Willing to hear any advice. Thanks!

erike · July 1, 2010, 4:13pm

If the numbers you have given (~1000 particles and ~1000 time steps) are correct, I don’t think you actually need to dive in to CUDA for this one if you don’t have any extreme timing requirements. A basic approach in the language you are comfortable with should be able to run this simulation in a very short time.

erike · July 1, 2010, 4:13pm

If the numbers you have given (~1000 particles and ~1000 time steps) are correct, I don’t think you actually need to dive in to CUDA for this one if you don’t have any extreme timing requirements. A basic approach in the language you are comfortable with should be able to run this simulation in a very short time.

elektroneg · July 1, 2010, 5:14pm

The example was simplified just to make it short to describe the main aspects ) But time is not the only problem.
The interaction with particles changes the state of the common substance. And every interaction should affect the probability of further interactions (well, maybe every 10-15 interactions). It would be great, if it was possible simulate the behavior of all the particles simultaneously. And that’s why I want to find out if it is possible to make it, using CUDA.

elektroneg · July 1, 2010, 5:14pm

The example was simplified just to make it short to describe the main aspects ) But time is not the only problem.
The interaction with particles changes the state of the common substance. And every interaction should affect the probability of further interactions (well, maybe every 10-15 interactions). It would be great, if it was possible simulate the behavior of all the particles simultaneously. And that’s why I want to find out if it is possible to make it, using CUDA.

kbam · July 2, 2010, 12:46am

Unless the calculations in the state transitions are very long then this sounds like it would run on CPU in seconds and unless you really need it to run in a fraction of a second there is no need to use a GPU.

1000 particles is tiny, at one thread per particle then could be done with only 2 blocks

3 questions

Am I right in thinking that it is essential that all particles complete a time step before any start the next timestep, ?
if so then interblock communication will be a significant design factor.
are roughly 25% of particles in each state or is there a huge bias to certain states ?
what nVidia GPU do you have ?

Cheers,

kbam · July 2, 2010, 12:46am

Unless the calculations in the state transitions are very long then this sounds like it would run on CPU in seconds and unless you really need it to run in a fraction of a second there is no need to use a GPU.

1000 particles is tiny, at one thread per particle then could be done with only 2 blocks

3 questions

Am I right in thinking that it is essential that all particles complete a time step before any start the next timestep, ?
if so then interblock communication will be a significant design factor.
are roughly 25% of particles in each state or is there a huge bias to certain states ?
what nVidia GPU do you have ?

Cheers,

elektroneg · July 2, 2010, 1:24pm

Yes, or at least they should be synchronized about every 10-20 steps, because I have to update the probability of interaction of particles with the common substance. Maybe it should be done less often - depends on amount of particles (should be somewhere around 10 - 100K) and the rate of interaction with common substance, which depends on quite a bunch of parameters, but it definitely should be done.
Most likely the majority of particles will stay in one state. They will all start in the 1st state and after a while, most of them will be distributed between several states (for this example let it be 3d and 4th states).
I have 8600GT, but soon I’ll have access to a tesla GPU.

Also I’ve been thinking of algorithms that would suit GPU architecture… And as soon as they are good at working with arrays of data, I thought that:

All the particles can be presented as an array of states (like [1, 3, 1, 2, 4…1])
For each state there can be only certain transitions, so I can make an array of transitions for each state (like [0,0,0,0,2,0,0,0,0,0,0,-1,0…0]), where I divide the probability of each transition by the number of particles and refill this array every n (e.g. n=100) steps. And after those n steps I update the state of the common substance and the probability of its interaction with particles. Then, depending on the state of the particle, the program picks the right array of transitions and adds the value of the current cell to the value of the cell in the array of states. So “0” will mean that the particle experiences no change in its state, 1 means that it changes its state to the next one (e.g. 2->3 transition), -2 means e.g. 4->2 transition. Though that means, I’ll have to make quite big arrays of transitions and maybe pick the numbers not successively, but maybe i*n+k (where “i” is the number of the particle (it’s place in the array), “n” is the number of steps before synchronization and “k” is the current step). In that way, device will have to operate with arrays of data… Maybe that’ll work better?

kbam · July 5, 2010, 1:23am

Thanks,
The 10-100k particles is very important as it makes a GPU based approach more favourable than with fewer particles.

There are always several ways to do something, the simple approach is one thread per particle. probably 256 or 512 threads per block. each kernel call does say 10 time steps and you put the kernel call in a loop with a SynchThreads in the loop as well. That implicitly keeps all threads at in the same set of 10 time steps (even if you had millions of threads) I think you will need to use atomic operations to update the ‘substance’. Global atomic operations are costly (GPU can do about 2million a second [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA ) I think you can reduce that by copying the substance into shared memory before each block runs its 1st step, using shared atomic operations on that copy, and then using a global atomic operation to update the master of the substance in global memory.

However 100k particles times 1000 time steps times say 100 operations per particle per time step is 10^10 operations so a CPU based approach could probably do the whole task in a few seconds. ( CPU will have the advantage that the ‘substance’ will probably stay in a register making it cheap to update)

elektroneg · July 5, 2010, 8:50am

Thank you! I’ll have to study that carefully… Maybe you can advice some example code, that is close to this task?

P.S. As I said before, that example, I’ve given, was simplified. Actually I’d like to reach miliseconds of simulation time, with a step about 0.5 - 1 ps, so it will be about a billion of steps. Depending on different parameters, it takes several hours to compute this on a CPU.

Lev · July 5, 2010, 10:20am

Key point here it is if particles could be processed in parallel or not. I.e. if p1 behaviour depends on p0 change. “Life” game could be easely transported on GPU btw.

elektroneg · July 5, 2010, 1:36pm

Indirectly via the common substance. But the state of the common substance changes quite slowly, so it can be updated every 100-200 steps. For these 100-200 steps the particles can be considered to be independent. They surely can be processed in parallel. But I’ve read that a multiprocessor can do only one operation to all it’s threads at a time, so there is some sort of difficulties, if the threads have some branches. And the particles should change their state randomly. So the question is if it is possible to do this on a GPU?

seibert · July 7, 2010, 1:01pm

Branches are less efficient, but most high end GPUs have quite a bit of instruction throughput to spare, so moderate branching is not a total showstopper. The hardware is also pretty good about recombining divergent warps after branches have finished. Random memory access tends to be a far more serious performance killer, unless your working set of random-access data fits in the L2 cache of a Fermi-based GPU.

Topic		Replies	Views
Need an advice about using CUDA for a certain task CUDA Programming and Performance	0	2401	July 1, 2010
Using CUDA to run many instances CUDA Programming and Performance	10	3201	April 1, 2012
Massive "simple" computation with CUDA CUDA Programming and Performance	14	8593	December 7, 2009
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3014	November 13, 2007
Survey of common CUDA runtime profiles What is your application like? CUDA Programming and Performance	11	7498	February 2, 2010
A (not so) hypothetical question CUDA Programming and Performance	6	1638	March 24, 2009
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9583	January 1, 2009
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20108	May 4, 2007
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6029	December 8, 2008
Some questions about CUDA and my problem CUDA Programming and Performance	3	2348	June 18, 2008

Need an advice about using CUDA for a certain task

Related topics