systems of ordinary differential equations examples using CUDA?

Hello:
Does anyone know of examples or papers on the use of CUDA for solving systems of ordinary differential equations? All I can find in my search are papers that say some groups have had success, or papers on partial differential equations, but I would very much like to find a paper or slide presentation giving an example of how CUDA can be used to solve simple systems of ODEs. Can anyone offer tips or suggestions or references or web pages?

Thanks, John

hi John,

i’m interested in the same sort of thing, but haven’t been able to find much in my first few days of looking. did you ever find any ODE examples?

cheers,
andrew

hmm… wouldn’t a system of ODEs require synching at every timestep of the solver, wherever there were interdependent terms?

(dX/dt) = Y+ 1/X
(dY/dt) = Z+ 2Y
(dZ/dt) = -X+Y

just as a for instance. no idea what the dynamics of the above would do, but how could one iterate over every step w/o shared continuous memory?

i’m totally new to both CUDA and C, so my observation may be naive, but it seem that a true “system” of ODEs requires shared memory across the parallelism space.

unless there is some way of putting an implicit “wait till this variable is updated with the correct iteration” condition on the equations, so things only updated as they had new info for the latest timestep.

but that’s pretty much just a synch barrier isn’t it?
so again i’m suggesting that an ODE solver would need to synch after ever timestep.

at least for one with discrete steps, like Euler’s method. a Runge-Kutta solver might be able to do it’s type of solving, across multiple steps to get the “next” one, without synching to memory.

guess i’m the only one posting to this thread… but i’m gonna keep doing it :)
maybe some of my brainstorming about how to handle a system of ODEs will be useful to someone else.

my latest thought: use a 2-column table (array) of values to store the X(t) and X(t+1) values

• first column is “old x”, new column is “new x”, from (old x)+(change instruction)*(step size)
• constructing a solver to skip back and forth between 2 atomic values would ensure that other variables (threads) would always have the correct value of other variables (correct for the time step we are in) without explicitly specifying which step you are in.
• this might be slower than using a sync barrier that is the same for all threads (variables) in the system of ODEs.
• more than 2 columns might be useful, e.g. if we are solving via Runge-Kutta 4th order, several values could be used in intermediate stages between X(t) and X(t+1).

I don’t understand what you mean by atomic. Anyhow, to me it seems sure that you need both an input and a different output array, because when in-place modification takes place, you can read new values where you would expect old ones.

You vector X needs to be large though to get CUDA benefit, but I don’t know what sizes you people are doing ;)

I don’t remember reading about successful ODEs on the forum, but I don’t need them at this time, so have not really checked.

What kind of parallelism are you looking for?
Parallelism across the system?
Parallelism across the method (ODE solver)?

From your descriptions, I’m guessing parallelism across the system.

yup, i have a somewhat self-contained series of ~ 20 equations that are largely interdependent, and thus need to be executed together, each timestep.

each one of these units is connected to any/all/none of MANY other units (2x10^6)

by connected, i mean that at varied intervals (less often than every timestep) output values from one node will be sent to another node as input.

the equations never change within one node, but the parameters values for equations within each node are set statically (for now) but are “tuned” to produce different “resting” behavior in the dynamics of that node. connection weights between nodes are currently being constructed in a massive matrix (2x10^36) but i may change the problem a bit so i’m only using 128 nodes.

so sorry, long answer, but yes. i DO need parallelism in the system. it seems to me that at every timestep, i’ll need to sync the equations within one block, and perhaps also synch data coming in / going out. using atomic operations on a shared output->input matrix would ensure the whole system ran in parallel, but would allow the equations for each unit to be executed in whatever order or sequentially, as long as the timestep barrier wouldn’t be passed.

this will be computationally pretty intensive, since it’s not mathematically a “single” step to go from a value of a variable at time t → t+1. a well constructed solver will be making several independent approximations and comparing them, which will take more than “one” systematic computation produce the new values for the timestep. each one of the evaluations can thus probably be parallelized across the method, e.g. one (totally independent) mathematical evaluation/approximation can be in an independent thread that only needs to synch at the same timestep barrier as the larger equation system.

i’m still reading through the programmers guide, while building my linux box, but i think this might be a sound (if abstract) way of approaching an ode solver.

thanks for the feedback!

Are you planning to write your own ODE solver or pehaps use one of the existing like MATLAB or CVODE? Have you looked at any of these?

You are probably interested in system parallelism rather than method parallelism.

As far as ODE code for single node with 20 equations is concerned, there is hardly any parallelism in it (method parallelism). At any particualar time instance, amount of work is fairly small and it is followed by feedback and adjustments. With that kind of code, each multiprocessor in GPU would not even be fully utilized (each multiprocessor in GPU runs a warp of minimum 32 threads regardless of whether your code uses that many or not).

However, if you spread computation for several different nodes with the same code across several multiprocessors (system parallelism), it would be possible to get speedup. You could assign computation for each node to a single thread. Threads for nodes that communicate often after sync stages should be placed in same blocks (fast access to per block shared memory, threads from different blocks would have to use slower global memory).

I am also looking for a parallelizable ODE code that I could port to GPU, but I am more interested in method parallelism. I have only one node with a set of 91 differential equations. I ported one of the MATLAB solvers to GPU. With the time span of 0-5000 ms, I can only get close to 1.4 speedup compared to MATLAB version.

I plan to make one soon, it will be a runge kutta type integrator ( 8 th order or 6th order not sure ), will be applying it to astrodynamics. I have some knowledge of parallel ode solvers for CPU , so lets see what I can come up with.

Mean while this post looks great, It nice place to share … ideas… for this work. I will keep you guys on loop once I start working on the… code ( currently i have exams in a week :( :( )

Thanks,

Nittin

Lukasz, thanks - great advice. i’m not too worried about method parallelism (a single instance of the ODE solver for a node) - and it seems that my problem might actually be best with 2*10^36 nodes. since my nodes will have connection with other nodes both local (synch often) and long distance (synch less often), that pretty much defines my block allocations for me :)

i’ve seen the CVODE stuff, but haven’t dug into it.
i’ve never written a solver, just systems to solve :) i should prob look at the Matlab solvers, but i’ve been using Berkeley Madonna for my few-node experiments thus far.

Nittin - that sounds great; please do let me/us know more! :)

I’ve gotten sidetracked with my own exams and papers, so my CUDA is going back-burner for a couple of weeks.

thanks, nice to see some people interested in this. :)
andrew

I have found a great paper with respects to parallelism across the method.

“On a parallel mesh-chopping algorithm for a class of initial value problems using fourth-order explicit Runge-Kutta method - C.P.Katti”

Hosted on Science Direct. I have access through my University.

I’m just having problems evaluating the functions in the thread on the Gpu, like the Matlab function feval(). Any help in this regard will be appreciated.