I have a question about monitoring a number of SM used on a CUDA kernel on Multi Process Service (MPS) environment.
If we run one CUDA stream, maximum # of SM is static in each CUDA stream and we can guess # of SM on each CUDA Kernel.
Under the MPS oversubscribed environment, maximum # of SM is dynamic in each CUDA stream.
It is because each CUDA stream sets # of SM. And total sum of SM is following equation.
My case example is follows
CUDA stream A(10) + B(20) + C(30) > GPU Device (40)
For example A=10, B=20, C=30, Device=40 (unit SM CU_EXEC_AFFINITY_TYPE_SM_COUNT )
Each CUDA stream has multiple CUDA kernels. # of SM allocation is differentiated by CUDA Kernel.
For example We asssume # of SM allocation as follows
C-1 30
C-2 20
C-3 10
When we try to run the stream B in parallel, Is there any method to monitor available # of SM?
When we are in C-1 phase, can we know only 10 SM remaining?
I want to monitor # of SM allocated to running CUDA kernel. It is because processing resource is coming from # of SM x time duration
.
cuCtxCreate_v3/cuCtxCreate_v4 CU_EXEC_AFFINITY_TYPE_SM_COUNT
cuCtxCreate_v4
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1gd84cbb0ad9470d66dc55e0830d56ef4d