Hi,
I’ve had some problems getting CUDA up and running on my laptop, but I’ve ordered a new desktop which should arrive next week and which (hopefully!) will allow me to compile and run programs without wacky driver problems. I’d like to hit the ground running in terms of how I convert my current code to CUDA in order to get the most efficient results.
The specific thing I’m implementing is particle-based viscoelastic fluid simulation(PDF), but if it helps make it easier to do the pseudocode for every frame it goes something like this:
for(each particle)
{
BuildNeighboursListForParticle(); // See note 1
for(each particle in this particle's neighbour list)
{
ComputeDensityContribution(); // See note 2
}
for(each particle in this particle's neighbour list)
{
ApplyForcesToThisParticleAndOtherParticle(); // See note 3
}
}
for(each particle)
{
UpdatePositionsAndVelocitiesBasedOnAccumulatedForces(); // See note 4
}
Note 1: The world is divided into grid cells, and the neighbours list for each particle consists of all of the particles in the same grid cell as the particle, plus all of the particles in the surrounding 8 grid cells (my simulation is 2-dimensional). When particles are moved, if they’ve changed grid cells, they remove themelves from the old cell, and add themselves to the new cell. This step simply builds a temporary linked-list type structure by hooking the end pointers of some cells to the start pointers of the next cell, for the 9 cells we’re interested in.
Note 2: This step performs calculates which add the density of nearby particles to the one we’re considering. It could probably also be adapted to add this particle’s density to the other particle as well, to do all of the work for particle pair AB and BA only once, although I’ve not found a stable way to do this yet.
Note 3: If “this” particle is A, and the “other” is B, this function deals with BA at the same time it deals with AB, so only computes each particle pair once. Basically the result is that in the final step every particle has stored delta values for how much their position needs to change this frame.
Note 4: This just iterates through all of the particles and applies the position deltas to the actual position, and calculates the new velocity using Verlet integration.
I hope that that makes sense… My question is, how can this best be converted to CUDA? My hunch is that for steps 1, 2 and 3, each particle can be given its own thread to calculate its eventual delta for this frame, because it seems like all of these things can be made to be calculated independently and simultaneously for each particle, but it seems like that might be a lot of work for each thread, and I’m not sure of the best way to manage the memory.
I’m new to CUDA, and I’m not even sure how to ask the right questions… I guess what I’m asking is this: Given that I have an understanding of SPH (and the Viscoelastic approach in particular), what would be the most efficient way to convert this algorithm to run under CUDA, taking the number and complexity of threads, and the memory management into account? How easy would it be to port this approach to OpenCL when that becomes available (as an indie bedroom developer I don’t imagine NVidia would have much interest in allowing me into the OpenCL beta…)?