Hello, I have a question about interference between processes when using MPS. First, I did experiments of measuring interference by running a single process and then check the latency of DenseNet121. I increased the mps percent by 10 with each test 10, 20, …, 90. After that, I launched 2 proceesses, with mps percent changing (10, 90), (20, 80), … , (50, 50) then checked the duration of DenseNet121.
It turned out that when running with higher mps percent, it gets lower interference. On the contrary, when running with lower mps, it gets higher interference. I think it is something with process with lower mps getting lower memory resource, but i’m not sure. Can you please explain why this kind of phenomenon happens? The below is the part of result of my experiment.
Duration of model on single process with 10 mps percent : 0.0576s
Duration of model on single process with 90 mps percent : 0.0141s
Duration of model on process with 10 mps percent with 90 mps percent process running together: 0.0851s
Duration of model on process with 90 mps percent with 10 mps percent process running together: 0.0151s
Duration of model on single process with 20 mps percent : 0.0308s
Duration of model on single process with 80 mps percent : 0.0143s
Duration of model on process with 20 mps percent with 80 mps percent process running together : 0.0465s
Duration of model on process with 80 mps percent with 20 mps percent process running together : 0.0163s
while mps percent 10% and mps percent 20% gets inteference of 47% and 50%
mps percnet 80% and 90% only gets interference of 13% and 7%.
Thank you in advance!
First, MPS, even with resource percentage allocation, doesn’t guarantee that there will be no interference between clients. MPS percentage execution resource partitioning divides the SM resources (AFAIK) but not anything else. In particular, there is no particular division of memory bandwidth (other than what may be inherent from the GPU design and the SM partitioning).
What I see in your output is that the model running by itself, with 90% of the GPU:
Seems to run a bit faster than that model running on 90%, with another model running on 10%:
That doesn’t seem surprising at all. The model running by itself at 90% has no competition for memory bandwidth. OTOH, when another client is present, there is competition for memory bandwidth. This might mean the model running on 90% may take a bit longer.
Memory/bandwidth probably isn’t the only resource that I can think of that isn’t expressly partitioned, but it might be the most important one, when thinking about these things.
Beyond that, I probably won’t be able to explain the exact relationship between percentage partitioning and level of interference. Since we don’t have an exact definition (that I know of) to explain how SMs utilize the available memory bandwidth, we can’t draw conclusions from that. However, if we assume that a single SM may be able to generate more memory traffic than what you would predict from (available bandwidth)/(SM count) on the GPU in question, then it stands to reason that a process running on a “small” partition might be able to use a lot of memory traffic by itself, but is “more impacted” when there is competition. But that is just hand-waving. I can’t offer a precise explanation.
MIG, on the other hand, tries to do a better job of full GPU partitioning, so that clients running on separate MIG partitions are (mostly) performance-isolated from each other. Not all GPUs support MIG, however.