I experimented with MPS on the hotspot CUDA application from the Rodinia-3.1 benchmark suite. A clone of that is maintained in my Git Hub.
(with MPS Enabled on the Linux System)
cuda/hotspot$ make
cuda/hotspot$ sudo nsys profile -o MPS-2-1 hotspot_launch.sh 2
The naming convention of the nsys output file used by me is as follows: MPS-<no_of_invocation_to_hotspot>-<interaction number>
So, the above MPS-2-1 means launched hotspot2 times under MPS for the first time. MPS-2-4 means launched hotspot2 times under MPS for the fourth time. (The nsys report files are present in rodinia-3.1-hotspot/cuda/hotspot/MPS_NSYS_REPORTS)
But, I noticed a huge variability in the cudaMalloc time.
The naming convention of the nsys output file used by me is as follows: NO-MPS-<no_of_invocation_to_hotspot>-<interaction number>
So, the above NO-MPS-2-1 means launched hotspot2 times under no MPS (time-slicing) for the first time. (The nsys report files are present in rodinia-3.1-hotspot/cuda/hotspot/NO_MPS_NSYS_REPORTS)
Now the time of cudaMalloc seems to be double of what was reported with MPS enabled.
Possible reasons for variability in cudaMalloc time duration:
The first cudaMalloc call in an application may be also one that incurs CUDA initialization overhead.
cudaMalloc is a synchronizing operation. That means it cannot complete until all previous device activity in that context is idle. So viewed from an API perspective, if there is ongoing device activity (kernel execution) when the cudaMalloc call is issued, it will wait there (from an API perspective) until the device is idle.
In a multithreaded situation (and perhaps MPS multi-process), the CUDA runtime may institute locks to serialize certain activity. The acquisition cost of these locks will vary depending on instance specifics.
There may be other reasons for this as well.
For these reasons, canonical advice is to do all device memory allocations early in your application, and keep them out of concurrent work issuance loops/situations.
It is typical that one observes a wide distribution of execution times for allocation requests to dynamic memory allocators. Most allocators, including the CUDA allocator, comprise multiple layers, and typically the cost of an allocation request depends on how deep the allocator has to reach into this hierarchy. This in turn depends on the size of an allocation request, attributes of the allocation request (if any), and the present internal state of the allocator (e.g. fragmentation).
In order to further comment on this specific case one would need to know details of the cudaMalloc() calls associated with these profiles. The cluster of intense “OS runtime libraries” activity associated with each cudaMalloc() call shown suggests to me that these are large allocation causing the allocator to dip down to the deepest (most expensive) allocator layer. However, without knowing the internal details of the CUDA allocator (as far as I know, not documented by NVIDIA, it may well change over time) I am not particular confident in this diagnosis.