CUDA & Smoothed Particle Hydrodynamics Best approach?

Hi,

I’ve had some problems getting CUDA up and running on my laptop, but I’ve ordered a new desktop which should arrive next week and which (hopefully!) will allow me to compile and run programs without wacky driver problems. I’d like to hit the ground running in terms of how I convert my current code to CUDA in order to get the most efficient results.

The specific thing I’m implementing is particle-based viscoelastic fluid simulation(PDF), but if it helps make it easier to do the pseudocode for every frame it goes something like this:

for(each particle)

{

	BuildNeighboursListForParticle(); // See note 1

	for(each particle in this particle's neighbour list)

	{

		ComputeDensityContribution(); // See note 2

	}

	for(each particle in this particle's neighbour list)

	{

		ApplyForcesToThisParticleAndOtherParticle(); // See note 3

	}

}

for(each particle)

{

	UpdatePositionsAndVelocitiesBasedOnAccumulatedForces(); // See note 4

}

Note 1: The world is divided into grid cells, and the neighbours list for each particle consists of all of the particles in the same grid cell as the particle, plus all of the particles in the surrounding 8 grid cells (my simulation is 2-dimensional). When particles are moved, if they’ve changed grid cells, they remove themelves from the old cell, and add themselves to the new cell. This step simply builds a temporary linked-list type structure by hooking the end pointers of some cells to the start pointers of the next cell, for the 9 cells we’re interested in.

Note 2: This step performs calculates which add the density of nearby particles to the one we’re considering. It could probably also be adapted to add this particle’s density to the other particle as well, to do all of the work for particle pair AB and BA only once, although I’ve not found a stable way to do this yet.

Note 3: If “this” particle is A, and the “other” is B, this function deals with BA at the same time it deals with AB, so only computes each particle pair once. Basically the result is that in the final step every particle has stored delta values for how much their position needs to change this frame.

Note 4: This just iterates through all of the particles and applies the position deltas to the actual position, and calculates the new velocity using Verlet integration.

I hope that that makes sense… My question is, how can this best be converted to CUDA? My hunch is that for steps 1, 2 and 3, each particle can be given its own thread to calculate its eventual delta for this frame, because it seems like all of these things can be made to be calculated independently and simultaneously for each particle, but it seems like that might be a lot of work for each thread, and I’m not sure of the best way to manage the memory.

I’m new to CUDA, and I’m not even sure how to ask the right questions… I guess what I’m asking is this: Given that I have an understanding of SPH (and the Viscoelastic approach in particular), what would be the most efficient way to convert this algorithm to run under CUDA, taking the number and complexity of threads, and the memory management into account? How easy would it be to port this approach to OpenCL when that becomes available (as an indie bedroom developer I don’t imagine NVidia would have much interest in allowing me into the OpenCL beta…)?

Just look at the particles demo in the SDK. It implements a mass-spring system in a uniform grid using a sorting each timestep. There’s also some documentation about how it works. Once you understand it, you can easily modify it to do SPH. I know I have. Though it is 3D, I think you could still use it to make a 2D version too.

First a few general comments: All of your steps appear at first glance to break down into one thread per particle very nicely. Also, in general, with the data-parallel model it is difficult if not impossible to have the AB interaction to write the BA interaction out to memory. Think about it: thread A is calculating the interaction on A and accumulating the total and as it goes it has to write out a value to location B in memory. But you might have 200 other threads also trying to write to location B in memory. The race conditions are atrocious. Secondly, if you actually count the number of memory accesses, it is higher when you have the AB write out the BA interaction (and memory bandwidth will certainly be the bottleneck). Thirdly, ignoring those issues, the memory writes from the A thread to the B threads are going to be scattered, meaning not coalesced and therefore slow by a factor of ~20 compared to coalesced writes.

Basically, it is almost always better in CUDA to have the A thread calculate the AB interaction and the B thread calculate the BA interaction. Don’t feel bad about the wasted FLOPs, the GPU has plenty to spare when the memory bandwidth is the limitation.

And if it didn’t come across above, the way you lay the data out in memory is the most important step in getting good performance. It’s all about getting the most memory bandwidth from the card that you can and that means coalescing reads and writes and/or making good use of the texture cache. So choose your data structures wisely first and then build the calculation kernels around them (see the paper I link below for details).

For example, your linked-list of cells is just not going to work out very well on the GPU. A wasteful grid that stores particles and empty slots works much better.

You are asking very good questions.

Building a neighbor list efficiently (O(N)) is by far the most complicated thing I have ever done in CUDA. And I’ve been writing CUDA code for a long time. Getting the memory management right and breaking it up into threads right is all a very complicated game. Fortunately, people like me have already figured it out :) I do molecular dynamics, which is very similar to SPH when it comes down to it. You basically just have different forces. Here is the paper we published on the topic: http://www.ameslab.gov/hoomd/downloads/pub…md_preprint.pdf

Note that the neighborlist algorithm in that paper is no longer in use. I’ve got a different way of breaking up the threads/blocks now that is 50% faster. The code is open source and you can get it from http://www.ameslab.gov/hoomd . And I’d be happy to discuss the details with you.

However, the neighborlist generation is such a complicated beast, you definitely do not want to start there if you are new to CUDA. I would highly suggest starting out by calculating the neighbor list on the CPU, copy it to the GPU and then learn CUDA by implementing the other, simpler steps. Once all that is done and you have great speedups for each of the individual easy steps, then come back and do the neighbor list. It worked for me :)

Converting the actual computation kernels from CUDA to OpenCL will be trivial. The OpenCL spec reads just like the CUDA one, except they changed all the names. Converting the code that does memory allocation and launches the kernels will be a tedious pain in the butt, but doable. CUDA has a huge amount of convenience built into the API that OpenCL doesn’t even try to emulate. Basically, in CUDA, allocating memory and calling a kernel can be done in 2 lines of code. In OpenCL, it takes 100+ (see the example code at the end of the OpenCL specification document).

If you haven’t already, you might want to look into Matthias Muller’s publications, here http://www.matthiasmueller.info/

Muller is the guy who did SPH for CUDA-accelerated PhysX game engine.

Regrettably, he doesn’t seem to have detailed papers on CUDA implementation on his webpage (I suppose he might be NDA’d a bit with this PhysX thing).

I’ve been thinking about doing an SPH solver in CUDA but couldn’t come up with a smart way of creating neighbourhood lists. And I figured n^2 computation, without nb lists, is A Bad Thing even on GPUs.

FYI, the CUDA implementation of PhysX SPH fluids uses code very similar to the particles sample in the SDK for neighbour finding (with a few additional optimizations).

Any chance on some hints as to what those optimizations could be?

Thanks for all the advice so far. In particular, the particles sample seems like it will be a useful starting point (I’d second Tovi’s request to find out what further optimisations exist in the PhysX implementation, out of sheer curiosity). MisterAnderson42, you did a great job at explaining the sort of broad approach I should be going for. I still don’t have my new computer to compile code samples with (annoyingly, I can run CUDA samples on my current machine, but I can’t compile them, so I have to treat the code as read-only for now :angry: ), but the papers are giving me food for thought. I imagine I’ll have lots more questions once I can start compiling and playing with code, but in the meantime I just wanted to register my gratitude for the replies.

Actually Simon Schirm is the fluid architect. But of course it’s all a team.

Some more details about PhysX SPH implementation:

http://sa08.idav.ucdavis.edu/CUDA_physx_fluids.Harris.pdf

Tim (tmurray) has said before to just fill in whatever parts of the registered developer app apply to you. You have to be willing to comply with the NDA though. I got an account, and I’m only an undergrad student.

Woot! I finally got some CUDA code compiling and running!

I’m looking at the particles sample, and it seems mostly straightforward. I’ve got it nominally working in 2D in preparation to convert it all to do some nice fluidy SPH, but I’m curious about how the position and velocity vectors are represented. All of the positions and velocities appear to be stored and represented as 4 floats (x,y,z,w), even though when they’re integrated they’re converted down into float3s. Is there a reason for this? I don’t see any particular matrix operations performed on those vectors that would be helped if they were 4D vectors, so I assume it’s some kind of memory alignment thing? If I were to rework the program to represent particle positions and velocities as just 2D (x,y) vectors, would I run into problems?

float4 reads and writes can be coalesced easily. float3’s cannot. That makes for a ~20x performance boost even though you write 25% more data (unused)

With float2’s you will be fine.

How often you have to update the neighbor list?

Is there any possibility to do so on the CPU?

That depends entirely on the system being simulated. In simulations of liquids, it is typically every 10 time steps or so. But a single neighbor list computation is 4-5 times more expensive than the per-timestep calculations.

Well of course, there is the possibility but it slows performance exceedingly. Consider it: The neighbor list build requires 30% of the computation time when everything is done on the CPU. If you leave the neighbor list on the CPU and optimize everything else in CUDA, then your maximum speedup for the entire application is only 3.3! That is pathetic. To get speedups of 64 for the entire application, the neighbor list has to be done on the GPU.

That sounds sad. I do not want to do this on the GPU. As you noted it is a complicated task.

Have you reached the 64x? Congratulations. My record is 30x at Tesla 1060.

On a related note, what’s the best way to lay my data out in general for this kind of computation? The particles example has an array of position vectors (float4s in the example, float2s in my converted code), and another array of velocity vectors. Are those arrays best kept seperate, or is it more efficient to have just one array of vectors, with position and velocity interleaved, or (presumably equivalently) an array of structs like:

struct sParticle {

	float2 pos;

	float2 vel;

};

Apologies if my questions are a bit newbie-ish, it’s because I’m new at this :"> I’m not looking for absolutely blistering performance, particularly, but I am trying to avoid making too many daft mistakes.

EDIT: I’m also getting myself quite confused about how best to arrange the threads, and what to turn into kernels. In the Particles sample it seems quite straightforward because the kernels are things you need to do for each particle, which seem to work quite well in parallel - Each particle can do its integration, hash calculation, neighbours list building via the Radix sort, and then collision. The collision is generally going to involve quite a small number of neighbours because the grid size is set to the diameter of the particles, so (if I’ve got this right), a thread gets spawned for each particle, but each of those particles does not spawn another thread specifically for processing the interaction between itself and each individual neighbour on its list, it just iterates through them on one thread. Am I reading that right?

Anyway, SPH obviously involves grid cells which are bigger than the particle’s collision diameter, because in SPH the particles have an effect on each other even when they’re not colliding, so potentially we’re dealing with larger neighbours lists, and I wondered if that would influence the overall layout of the code in any way. I suppose what I’m asking is whether somebody might be able to take another quick look at the pseudocode I posted at the start of the thread, and indicate which things should be turned into CUDA kernels, which parts (if any) should still be done in C++ on the GPU, and whether I need to refactor parts in order to make that happen (shifting loops around, or the order things are done in, or whatever).

It shouldn’t make much difference whether you do two loads of float2 or one load of float4. CUDA can coalesce memory operations on structures that are 32bit, 64bit and 128bit (float, float2, float4 and equivalents, like your 128bit sParticle). On my card there’s less than 4% difference in bandwidth between both cases. I’d say use whichever makes more sense to you. Just don’t use float3s.

Step 2, 3 and 4 should be done on the GPU and they’ll be pretty easy if you find a nice way to go over a particle’s neighbours. Step 1 would also be best done on GPU to avoid PCI-E memcpys but that’s the tricky part. Sorting, making neighbours lists and then going through them in subsequent steps while maintaining coalescing as often as possible. Force calculations, integration and such are easy.
Did I say “lists”? Forget about linked lists, you don’t want to go into pointer chasing on the GPU.

Perhaps I phrased my question about the neighbours lists badly - I’ll try again.

I’m aware that steps 2, 3 and 4 of my code should be performed on the GPU, and that ideally step 1 should be too, if there’s a way I can work that out. I’m also aware that my linked-list based approach for neighbours list building isn’t going to be of much use: To begin with at least, I’m looking into whether I can do something very similar to the sort-based approach in the Particles SDK sample, with changes being made to reflect the fact that a larger grid cell size means that there can be more than 4 particles in a given cell.

What I haven’t worked out, and what I was asking, is which parts of my code should be kernels. I’m confused because with my code as it currently stands I can’t see the best way to do that. One approach could be to write two kernels as replacements for the two “for(each particle)” loops. Certainly the second one (containing step 4, which is basically just integration) seems sensible, but it would mean that the first kernel would be responsible for building a particle’s neighbour list, then iterating through it twice, once to compute densities and then again to apply forces. It seems like a lot of work for one kernel, and involves iterating through neighbours lists containing potentially 150 or more particles - I believe the Particles SDK demo doesn’t ever consider more than 32 particles, which is why I wondered whether there’s a way to parallelise these neighbour interactions, or whether processing neighbours lists iteratively was the only way. Another approach might be to write a kernel for step 2, and another for 3 (both of which deal with the interactions between the “current” particle and a specific one on their neighbours list), but iterate through each “current” particle on the CPU as I do now, but that sounds to me like a very inefficient way of using CUDA. Or is there a way for kernels to invoke other kernels, and if there is, is it sensible to have them do so in a situation like this?

Sure, the sort based method works fine for any number of particles in each cell. You just need a fancier data structure than the SDK sample uses so you can store more than 4 in a cell. I discuss such a data structure in that paper I linked to way back at the top of this thread…

That does sound like a lot of work for one kernel. I would break it up.

Well, depending on the interaction you could set it up so a block processes each particle. Each thread in the block would handle a single neighbor of that particle and your results would be added up in a parallel reduction. In my own experimentation with such a method, it turned out to be slower even for systems with ~100 neighbors per particle, but your mileage may vary.

This is what I’ve been struggling with. From the discussions so far, my original pseudocode might look something like this, so far:

for(each particle)

{

	// Kernel 1

	// Done using some variation on the sort-based methods we've been discussing

	BuildNeighboursListForParticle();

}

for(each particle)

{

	// Kernel 2

	for(each particle in this particle's neighbour list)

		// Accumulates all of the density contributions from each neighbouring particle, and also stores "q" for each particle pair

		ComputeDensityContribution();

	for(each particle in this particle's neighbour list)

		// Calculates pressures based on the previously-calculated density and "q" values, resulting in an impulse to be applied to the particle

		ApplyForcesToThisParticle(); 

}

for(each particle)

{

	// Kernel 3

	// Apply the impulse calculated in kernel 2, and perform integration

	UpdatePositionsAndVelocitiesBasedOnAccumulatedForces();

}

Kernel 2 being the large one which might need to be broken up, iterating as it does through a particle’s neighbour list twice, and performing various other branching operations. Kernel 2 could be further broken up like this, perhaps

for(each particle)

{

	// Kernel 2A

	for(each particle in this particle's neighbour list)

		// Accumulates all of the density contributions from each neighbouring particle, and also stores "q" for each particle pair

		ComputeDensityContribution();

}

for(each particle)

{

	// Kernel 2B

	for(each particle in this particle's neighbour list)

		// Calculates pressures based on the previously-calculated density and "q" values, resulting in an impulse to be applied to the particle

		ApplyForcesToThisParticle(); 

}

But to do so would involve either massively increasing the storage needed for caching values of q, or not storing it at all and recalculating the value a second time in kernel 2B. With the caching approach, currently that’s one float per possible neighbour of any given particle - assuming that under circumstances of extreme compression a particle might have (for example) up to 500 neighbours, thats 500 *sizeof(float) = ~1.9K - but if I were to split the kernels out in this way it would be the number of particles (say 32768) * 500 * sizeof(float) = 65.5 megabytes… With the recalculation approach, I’d be duplicating a couple of adds, subtracts, multiplies, two if statements and possibly/probably a square root. What would conventional CUDA logic dictate to be the sanest approach here?

Apologies if this seems like a stupid question, by the way - storing buffers that big seems crazy to me, but then so does performing twice as many square roots as is necessary. I’ve not been able to track down anything which gives me a solid idea of what a sensible upper limit should be for the size or complexity of any given kernel.

I would worry less about the storage requirements and more about the time it would take to re-read in all those values from global memory. At 100 GiB/s, reading and writing 65.5 MiB costs you an additional 1.2milliseconds. That is pretty steep penalty when you don’t have to do it.

This is a very good question. And illustrates very well the compute vs memory bandwidth capabilities of the GPU. Modern GPUs can hit 1 TFLOP, but only under idea circumstances. Taking a much more modest estimate that you could obtain ~300 GFLOPs doing these operations you’ve got: 32768 * 500 neighbors * 5 FLOP/particle / 300 GFLOPS = 0.27 milliseconds. I hope the answer is clear :) And as I said it would, this illustrates “conventional CUDA logic” of compute vs memory bandwidth: The balance between FLOPS and GiB/s on the GPU is such that you can easily “waste” 10-20 (or maybe even more) FLOP calculations and still save time vs reading in a pre-calculated value from global memory.

Honestly, even after all these checks for the best way to split the kernel, I think your best bet is to do both neighbor loops in one kernel. Then you avoid both the wasted global memory bandwidth and the “wasted” FLOPs. That, and the two loops are going to be very similar right? That will make it easy to combine them. A loop to build the neighbor list is structured a lot differently than a loop through the neighbor list, so combining them in the same block/grid configuration could be a significant challenge.

When I suggested the breakup, I was specifically referring to breaking out the neighbor list build from the loops through the neighbor list. Presumably, that also incurs “wasted” memory bandwidth, but if you use the standard trick of adding in a buffer, then one neighbor list build is good for quite a few time steps before it needs to be rebuilt.

And when it really comes down to it, there really is no hard and fast rule to say when kernels must be broken up. As we’ve seen here, in some cases memory bandwidth can be saved by combining the kernels. But balancing against that, a big uber-kernel is a lot harder to program, maintain, test, and debug. Is 300% more programming effort worth ~10% extra performance? It could be… Of course, you can always start with the easier to program smaller kernels and then move on to combining them later if the performance losses seem excessive.