I was given pause by what I discovered on a rainy afternoon two days ago. I have spent the past three years developing a code base for running many MD simulations on a card at one time. I still think there is merit in what I am doing. I have devised new methods that both expedite the integration of many systems in a single runtime instance and also improve the way each individual system runs. But, this has changed the way I think about compute-bound versus memory-bound processes: if there is one of each, interleaving them is absolutely possible, and will even happen automagically if you just throw enough processes onto the card at one time, even with something agnostic to the nature of each process like MPS.
Another way to put it is that, except in the limiting case of raw performance on a single problem, making something go faster by cutting down on memory latency with intense computation is pollution–it occupies arithmetic units that could otherwise be utilized by a process running in parallel, and it throws off heat that limits what other processes can do or even what the card is able to sustain in future kernel launches.
It makes me wonder, since I will have explicit control over which systems are running at any one time, should I expect to be able to use CUDA streams to interleave communication and compute-bound processes to the same effect as MPS?
For reference, here is one of my data sets. What I did was create a single MD system of 23,558 atoms (this is an update of the dihydrofolate reductase “DHFR” benchmark that is well known to molecular dynamics developers). I then ran multiple simulations through multiple processes in MPS with three popular MD codes. Those results are shown marked with filled circles (or, cream-filled donuts, if your Monday needs to slow down). I also tiled the system into single simulations of 2x 23558 = 47116 atoms, 70674 atoms, all the way to 15x = 353370 atoms (and even further, but that gets into other limits in each code base which are outside the scope of this discussion), and ran those as a single process on each card. (Those are the square-decorated lines, shown in relief.) As you can see, MPS outperforms tiling the system and running “multiple” copies of the protein simulation that way almost every time, sometimes by a wide margin. This is pretty easy to rationalize: when there are multiple simulations, the compute-bound and memory-bound processes can interleave, and the MPS scheduler seems to be doing a very good job of this.