MPS: Best practise for SM partitioning

In MPS document 2021 there is a new section 5.2. BEST PRACTICE FOR SM PARITITIONING. In that section, it is mentioned that “Creating a context is a costly operation in terms of time, memory, and the hardware resources”. However, with MPS only one CUDA Context is created in the GPU, to allow kernels to execute in parallel. Consequently, I could not understand why one should care about contexts? The only explanation could be that you refer to Client CUDA Contexts. Can you please explain?

Additionally, in the example that you create a pool of contexts why do you use cudaLaunchCooperativeKernel? According to the documentation cudaLaunchCooperativeKernel " Launches a device function where thread blocks can cooperate and synchronize as they execute". As a result, simple cudaLaunchKernel will have the same behavior, since we do not have kernels that cooperate. Is that correct?

Thank you in advance. Manos