Sorry for bothering you!
I am writing an application which needs to deal with messages of different kinds on a single GPU (T4 or V100). There are two kinds of messages, the first one is rare and computational expensive. The second one is common and cheap to process. I have tried bind each type of messages to a separate cuda stream, but lead to interference.
Someone told me that I can use MPS to assign a portion of GPU resources to each type of messages, which can avoid interference. After read the official docs of MPS, I realize that it only supports multi-processes model (correct me if I am wrong), which is very different from the multi-thread model I am currently using. It is hardly possible to refactor the whole projects due to the heavy engineering effort.
When reading the MPS docs, I find a code snip that shows I can create cuda context with different SM counts in one process and launch kernels with different contexts. In my case, I can use one context for each type of messages. Since I use cuda runtime to do all my project and I am not familiar with cuda driver api, I am wondering whether it is a good way to solve my problem? Will it leads to bad performance since there are multiple contexts on one GPU? Is there a better way to restrict the resources for a kernel?
Thank you so much!
Hello @user52911 and welcome to the NVIDIA developer forums.
I think this very CUDA specific topic is better suited for the experts in this area, so I take the liberty of moving it to the corresponding CUDA forums.
Thanks!
I think the MPS method is a good method to restrict GPU resources for a particular client. As you pointed out, it requires multi-process usage.
With the CUDA runtime API, and avoiding any use of the driver API, there is no practical method to make use of multiple contexts per process. The runtime expects to use the so-called “primary context”.
Within the runtime API, you can restrict a particular kernel call to only use a portion of the SMs; simply launch that kernel with that many blocks. For example if you have a GTX 1660 Super with 22 SMs, and you launch a kernel with 16 blocks, then there will be 6 SMs that are unoccupied. If you launch another kernel, it will be able to use (at least) those 6 SMs, without interference.
This is difficult to orchestrate in a complex setting where there are many kernels being launched, but if you have just 2 things to partition as you seem to be indicating, it might be something to consider. This method is probably also less than optimal compared to the MPS method because it may be difficult or impossible to get full GPU occupancy using this method. So there may be overall performance considerations. But partitioning a system with MPS generally also has performance considerations.
Finally, CUDA stream priority may be something to consider.