General question on MPS set_active_thread_percentage

Hi, there,

I read the MPS document and want to leverage set_active_thread_percentage to limits the GPU resource to each running model. My setup is like this: I run two inference models at the same time on the MPS server and want to see their concurrency performance. I did export set_active_thread_percentage=50 on the command line and start the MPS service. Then I run python model1.py & python model2.py on the command line. I’d like to know whether set_active_thread_percentage is applied to each inference model (meaning that each model inference takes 50% of SMs, for example) or the two models as a whole (meaning these two models together takes up 50% SMs of the GPU). If it is the latter case, can you tell me how to apply 50% to each model?

Thanks!

1 Like