MPS has gotten really good, but can CUDA streams replicate the benefits?

I was given pause by what I discovered on a rainy afternoon two days ago. I have spent the past three years developing a code base for running many MD simulations on a card at one time. I still think there is merit in what I am doing. I have devised new methods that both expedite the integration of many systems in a single runtime instance and also improve the way each individual system runs. But, this has changed the way I think about compute-bound versus memory-bound processes: if there is one of each, interleaving them is absolutely possible, and will even happen automagically if you just throw enough processes onto the card at one time, even with something agnostic to the nature of each process like MPS.

Another way to put it is that, except in the limiting case of raw performance on a single problem, making something go faster by cutting down on memory latency with intense computation is pollution–it occupies arithmetic units that could otherwise be utilized by a process running in parallel, and it throws off heat that limits what other processes can do or even what the card is able to sustain in future kernel launches.

It makes me wonder, since I will have explicit control over which systems are running at any one time, should I expect to be able to use CUDA streams to interleave communication and compute-bound processes to the same effect as MPS?

For reference, here is one of my data sets. What I did was create a single MD system of 23,558 atoms (this is an update of the dihydrofolate reductase “DHFR” benchmark that is well known to molecular dynamics developers). I then ran multiple simulations through multiple processes in MPS with three popular MD codes. Those results are shown marked with filled circles (or, cream-filled donuts, if your Monday needs to slow down). I also tiled the system into single simulations of 2x 23558 = 47116 atoms, 70674 atoms, all the way to 15x = 353370 atoms (and even further, but that gets into other limits in each code base which are outside the scope of this discussion), and ran those as a single process on each card. (Those are the square-decorated lines, shown in relief.) As you can see, MPS outperforms tiling the system and running “multiple” copies of the protein simulation that way almost every time, sometimes by a wide margin. This is pretty easy to rationalize: when there are multiple simulations, the compute-bound and memory-bound processes can interleave, and the MPS scheduler seems to be doing a very good job of this.

I gather that the datapoints marked with circles/donuts are the MPS examples (you said that already) and the datapoints with squares are the “tiled” non-MPS variants?

When I am teaching CUDA, I sometimes describe MPS as a way to allow kernels (or work, if you prefer) launched from separate processes to behave as if they were launched from a single process. I describe it that way because I consider that to be a good thing.

Going back to kernels, if your “tiled” realization is launching the same kernels, with the same data, as the MPS variant, and the MPS variant is simply splitting that into multiple processes to issue, then I would be surprised that there is such a difference, unless you have introduced unexpected dependencies via streams that are “clearly” not represented in the MPS case.

MPS was created to solve work delivery problems, with an eye towards getting “closer” to the ideal work delivery. To posit that MPS could simply do better than what is achievable from a single process in the ideal case doesn’t make sense to me, but I don’t know everything and so I will simply say “its possible, I suppose”. It is certainly the case that given certain kinds of work delivery breakdown (multiple processes issuing work, any one of which is too small to fully utilize the GPU), that MPS does better than non-MPS (that is the canonical case it was designed for). It may be that your case straddles both of these ideas.

Even leaving all that aside, the “ideal” work delivery can be difficult to achieve in practice, or might result in code that is brittle or difficult to maintain or adapt. It’s a general programming principle to be modular; don’t put everything into a single function call. With that mindset, MPS is a relevant tool.

Yes, without further example, that would be my expectation. MPS work is or should be “independent” between processes. If you issue work sequences via MPS, or issue the same work fully asynchronously among streams in a single process, I would not expect MPS to do significantly better.

1 Like