Improving MPS performance using Volta MPS Execution Resource Provisioning

dusxodud98 · July 3, 2019, 9:32am

Hi, I am CUDA developer working on improving performance of CUDA program using MPS.

I am using Volta MPS Execution Resource Provisioning method to improve performance of CUDA MPS - reference : https://docs.nvidia.com/deploy/mps/index.html#topic_3_3_5_2
I am working on this environment :

Tesla V100-SXM2
VGRAM : 16GB
CUDA version : 10.0
Driver Version : 410.104

I tried many ways for this method, however it didn’t work.
First, I turned on MPS server using this code :
#!/bin/bash

following must be performed with root privilege

export CUDA_VISIBLE_DEVICES=0
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
taskset -c 0 nvidia-cuda-mps-control -d

And then I runned 10 clients simultaneously.
For Volta MPS Execution Resource Provisioning, I set CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to 10.
However I couldn’t observe improvement. The result was same although I set value to 20 using other formula.
Even I set value to 0, the output was almost same. I think the program couldn’t be run if I set that value to 0.

How can this happen?
Is there any mistake I made? Or should I change process for this method?

Sincerely,
Tae Young Yeon.

Robert_Crovella · July 3, 2019, 1:17pm

I wouldn’t assume that the division of resources to 10% per client is always beneficial or will always show improvement. Just because you didn’t see a benefit, does not mean that something is wrong.

For example, if the work you are issuing per client is sufficient to saturate the GPU, then it is unlikely that MPS (with or without any provisioning) would provide any benefit, as compared to issuing that work sequentially, from one client.

njuffa · July 3, 2019, 1:29pm

I would go a bit further and assume that the use of MPS with a single client might reduce performance somewhat in that scenario, as the flexibility MPS provides for sharing a GPU between multiple clients presumably creates some additional overhead.

dusxodud98 · July 4, 2019, 1:07am

I seem to have misrepresented my points.
I already observed improvement of performance using MPS while running my own multiple clients.
Now, I am trying to improve more using Volta MPS Execution Resource Provisioning method.
So, I correct my question - Does CUDA_MPS_ACTIVE_THREAD_PERCENTAGE environment variable really works?
What does that variable mean?
I can’t get information completely from the reference.

Thank you for effort,
Tae Young Yeon

Robert_Crovella · July 4, 2019, 1:34am

MPS generally may improve the performance of multiple clients using the same GPU, compared to multiple clients using the same GPU without MPS. However, if the work issued per client is large enough, the improvement may be so small as to be not measurable.

As far as I know, CUDA_MPS_ACTIVE_THREAD_PERCENTAGE works, and if you find that it doesn’t that is presumably a bug.

If you have 10 clients, setting the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to any particular value may not create any noticeable difference. Previously you seem to be judging whether or not it was “working” based on whether or not you saw any improvement. I don’t think this is a mischaracterization of your previous statements, to wit:

Please read the relevant sections of the CUDA MPS doc. What it does is defined in section 4.2.5:

https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

and behavioral expectations are outlined in section 2.3.5.2. Note that there are two goals stated, and neither has to do directly with an improvement in performance (defining performance as wall clock execution time for a given quantity of work). In fact I think it should be plainly evident that if you arbitrarily constrain a particular process to using 10% of the GPU execution resources, as opposed to letting the GPU make its own scheduling decisions, the wallclock execution time might possibly become worse.

If you think that something is not working, I suggest providing a careful definition of what you believe “working” and “not working” mean, and provide a complete test case that demonstrates your observation. A complete test case means a complete code that someone else could run, and see your observation.

Of course you’re welcome to do as you wish. These are just suggestions.

Regarding your statements about e.g. setting the limit to 0, this case is not allowed. The limit you set will have both a lower bound and a granularity applied to it. This is evident from the doc statement:

"The limit will be internally rounded up to the next hardware-supported thread
count limit. "

dusxodud98 · July 4, 2019, 2:04am

Thank you for your detail explanation. I understood what I missed.
It was a lot help.

Best,
Tae Young Yeon