I noticed in the MPS document that one of the benefits of MPS Execution Resource Provisioning is to enhance the quality of service.
Improve QoS: The provisioning mechanism can be used as a classic QoS mechanism to limit available compute bandwidth. Reducing the portion of available threads will also concentrate the work submitted by a client to a set of SMs, reducing destructive interference with other clients’ submitted work.
May I kindly inquire for a more precise explanation regarding the specific aspects of QoS that are being referred to in the context here? Additionally, it would be enlightening to understand how Execution Resource Provisioning can enhance the QoS.
Based on my understanding of QoS, I’ll outline how I set up the experimental verification of Execution Resource Provisioning to improve QoS. Firstly, I’d like to point out that our application scenario is the pipeline of autonomous driving, where the running models and their timing are relatively fixed. In the experiment, we have selected three models: Resnet, Mobilenet V2, and MNIST, which run in different processes. To conduct the test, we start inference for all three models at the same time and record their inference time until all models complete inference. This process constitutes one round of testing, and we repeat this multiple times to obtain the average inference time of the three models.
Next, we adjust the SM resource configuration of each process using Execution Resource Provisioning and conduct the above-mentioned tests again. We record the inference time of the three models in each round and calculate the fluctuations under different resource configurations. If the fluctuations are larger, it indicates that the QoS is poor. Conversely, if the fluctuations are smaller, it means that the QoS is good.
May I ask whether the above test setup is sufficient to effectively demonstrate the impact of Execution Resource Provisioning on QoS? If it is not adequate, would you kindly suggest a better experimental setup? Thank you in advance.
Let’s consider the case where we have 2 clients, and we restrict client 1 to using 70% of the GPU compute resources.
QoS in this context is referring to latency of response (i.e. providing an upper limit to observed latency) and throughput (i.e. providing a lower limit to observed throughput). Without the restriction of resources for client 1, the GPU could be (fully) busy processing client 1 work when a work request from a client 2 arrives. That work request will wait until resources become available. This could increase the latency for client 2, as compared to the case of a client 1 restriction via MPS.
A similar statement could be made about throughput. Without a restriction on client 1, the expected throughput for client 2 will depend on what else the GPU is doing and could be arbitrarily low. With a restriction, the minimum expected throughput for client 2 should correspond to the size of the “reservation” - i.e. the amount of resources not included in the restriction.
In my view, a key purpose of resource provisioning is to provide a guarantee to one client, in a multi-client scenario, that that client will not get “starved”. You can certainly extend this idea to more than one client, but it’s not obvious to me how it could be extended to all clients, when considering overall performance. Since each control is in the form of a restriction, I don’t know what to expect if you restrict client 1 to 70% and restrict client 2 to 70%. I"m not sure what that buys you, and I’m not sure how to make predictions of performance or set expectations in that case. (OK. In that exact case, you can expect that neither client will get less than 30% of the GPU. But if you extend that to 3 or more clients, then you can’t advance the idea of prevention of starvation.) Of course if you restrict client 1 to 50% and client 2 to 50%, then its easier to make predictions and set expectations. However, provisioning every client cannot improve average throughput globally, I don’t think. Provisioning every client can prevent a single client from “hogging” the GPU.
An experimental setup to demonstrate the things I discussed would be to have one client issue “endless” work. That is, issue a kernel over and over again. This kernel should fully occupy the GPU. Then have a second client issue an “occasional” kernel, and measure the latency to completion of each kernel issued by the second client. The worst-case latency measured here should be better in the case of resource provisioning than without it (with a test designed like this.)
I apologize for the imprecise description of the experiment. Allow me to provide a clearer and more accurate explanation.
The experiment consists of the following basic setup:
Three models, A, B, and C, have been selected to run in different processes
Inference is started for each model simultaneously, and the respective inference times are recorded
The process is repeated multiple times after all three models have completed the inference
The measured data for each model’s inference time is used to calculate fluctuations, using mathematical means such as variance, standard deviation, and percentile
In order to verify the impact of Resource Provisioning on QoS, we configured the SM resources of each model as follows
Model A: 100%, Model B: 100 %， Model C: 100%
Model A: 50%, Model B: 50%, Model C: 50%
Model A 20%, Model B: 30%, Model C: 50%
We expect to witness the following outcomes:
The model’s inference time is most susceptible to fluctuation under the first configuration
The second configuration follows with less fluctuation
The third configuration experiences the least amount of fluctuation in inference time.
An experimental setup to demonstrate the things I discussed would be to have one client issue “endless” work.
During the experiment, I made sure that one of the models did not run in an “endless” loop. This was because the timing between the models is relatively fixed in the autonomous driving scenario. If one model runs in an “endless” loop, it can impact the timing between the other models and affect the overall quality of service. Our aim was to test the impact of resource provisioning on QoS in the context of autonomous driving.
I hope my explanation of the experiment was clear. I believe that this experiment could potentially validate the QoS that you described in your response. What is your opinion on this matter?
I spawned three processes within the main process, thus enabling me to maintain simultaneity across the processes by sending signals to them. I did not take any measurements of the variations present in that aspect
Certainly, the child processes will send a signal to the main process once they have completed the inference. This allows the main process to issue signals for the subsequent cycle.
I guess the end goal here is to arrange the work so that all 3 inferenceing requests complete at approximately the same time, with minimum unused resource. Your case 3:
seems like a reasonable attempt to do that.
I’d like to point something out, keeping in mind that the GPU is a throughput-oriented machine. If each of your 3 inferencing requests fully occupy the GPU, then I’m not sure there is much reason to believe that this “careful partitioning” will result in anything better. MPS is still beneficial because it allows for simultaneous execution without context switching, but launching these 3 inference requests sequentially, from the same process (lets say in separate streams), might be the best solution. Chopping up the machine requires considerable care, and could hinder your goal if any of the specifics change.
Sure, chopping up the machine means you can arrange for each request to finish at approximately the same time, but I’m not sure what benefit that is, if that time is actually later than it would have been by issuing the 3 inference requests in sequence.
Hi Robert, I have a follow-up question. I conducted the test that was mentioned previously and used the Coefficient of Variation of the inference time as the metric to evaluate the quality of service. The test results are shown below.
googlenet: 100, mobilenet: 100, resnet: 100
googlenet: 40, mobilenet: 60, resnet: 70
googlenet: 50, mobilenet: 50, resnet :50
googlenet: 20, mobilenet: 30, resnet :50
Subsequently, I conducted a test that involved running the models individually with MPS enabled. The test results are shown below.
From the data presented in the tables above, we can observe several patterns.
When resources are provisioned in a more detailed manner, there is less overlap of SM resources among models, resulting in better QoS.
The availability of fewer SM resources corresponds to better QoS quality.
Even when the same amount of SM resources is available, QoS deteriorates when multiple processes run simultaneously as compared to when a single process runs alone.
I believe Pattern 1 is precisely the outcome we hoped to achieve, as it demonstrates the potential benefits of resource provision, as outlined in the MPS manual, for improving QoS.
However, Pattern 2 is confusing me a lot. Do you have a reasonable explanation for it?
Regarding Pattern 3, I must admit it also confuses me. According to the first test, where the SM resource allocation was set to googlenet:20, mobilenet:30, resnet:50, each model was already provided with independent compute resources. Therefore, in theory, their Quality of Service (QoS) should have been similar to when they were run independently with the same SM resource allocation. However, the outcome showed that the QoS for each model suffered significantly when all three processes were running simultaneously in comparison to when they were run independently. The results can be compared as follows.
Model (SM resources)
Run Simultaneously (%)
Run independently (%)
I was wondering if there are any other factors that might affect the QoS of the model apart from the SM resources. Could the thread block scheduler be one such factor? I might be mistaken, but my understanding is that a single GPU has only one thread block scheduler. When several models run concurrently, they have to share this scheduler, which might result in longer inference times as they wait for their turn.
If you have any further insights, I would greatly appreciate it if you could share with me. Thank you in advance.
In the case of your first table, that doesn’t surprise me too much. I already stated that when “provisioning all clients” I didn’t know what to expect unless you provide a separate “lane” for each client. Only your last case does that. It should make things most predictable, which I would assume reduces variation.
For pattern 3 (introducing more clients increases variability) this does not surprise me. The “partitioning” provided by MPS is not complete. Various internal resources are probably not partitioned. (On GPUs that support it, MIG can probably do a better job of partitioning/variability/disturbance isolation/QoS) For example main memory bandwidth and/or L2 cache bandwidth are probably not perfectly partitioned. I won’t be able to cite documentation for this, nor will I be able to give an exhaustive specification of what is or isn’t partitioned. Refer to the MPS docs for whatever may be published. I’m simply stating that if not all resources are partitioned, then additional activity on the GPU will perturb things.
I’ll state it again. Predictability does not necessarily imply highest throughput, or even shortest time to finish the work.
Thank you for your reply. I now have a clear understanding of Pattern 3
Regarding Pattern 2, what I meant was the data presented in the second table, where all models are run independently. It appears that the QoS of the googlenet model increases as the SM resources decrease, which is also observed in the other two models.
I can’t explain Pattern 2 without a test case and a lot more study. And I probably wouldn’t invest that time, so I’m not asking for such.
A GPU is a collection of execution resources, that are dynamically provided to various “clients” or “workers” within the GPU, eg. threads and threadblocks. As you restrict the size of the lane that code is allowed to operate in, you are reducing the number of clients/workers/threads/threadblocks that can be running at any given moment. So you are reducing the amount of dynamic variability or interaction that may occur. Furthermore, some things are not reduced as you reduce the size of the lane that the code is allowed to operate in. For example you have reduced the SMs that a code may operate on, but you have not reduced memory bandwidth that the code may use nor the L2 bandwidth (except as such reduction may come about implicitly due to GPU architecture, e.g. SM connections, crossbar connections, etc.) Therefore you are reducing the size of the worker complement, but giving that worker complement a “less disturbed” environment to work in. That seems to me like it may reduce variability.
Hand waving. But I don’t wish to delve into the details, either.
The elephant in the room is that for table 2, the execution duration is certainly increased. So, as I have stated half a dozen times now, this “QoS improvement” ought not to be considered in a vacuum.