cudaMalloc calls from multiple threads on same GPU / multiple processes on different GPUs - serializ...

I have an partially GPU-accelerated algorithm with a few cudaMalloc (actually cudaMallocPitch) calls in it.
System is Windows 10, WDDM driver is used.

When invoking the algorithm from multiple CPU threads on the same GPU (in order to better utilize it), I do not get the expected speedup. I suspect it is due to the cudaMalloc calls.

I have tried to invoke the algorithm also in multiple processes (all using the same GPU) in order to better utilize the GPU, still I do not get the expected speedup.

Now my questions (for WDDM / TCC mode):
(A) Are cudaMalloc calls on the same GPU in multiple threads serialized ?
(B) Are cudaMalloc calls on different GPUs in multiple threads serialized ?
© Are cudaMalloc calls on the same GPU in multiple processes serialized ?
(D) Are cudaMalloc calls on different GPUs in multiple processes serialized ?

I assume win10 64bit, therefore UVA is in effect.

I would expect so. (The one possible exception might be case D if you are using CUDA_VISIBLE_DEVICES to make sure each process only has “their own” GPUs in view. Not sure about that.) Not only do cudaMalloc operations serialize, but they will synchronize other activity as described in the programming manual:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#implicit-synchronization

The general advice is to avoid doing cudaMalloc during timing-sensitive operations. Allocate your buffers once, before you start the work, and then reuse them.

UVA forces all memory maps in the system to be harmonized (with the possible exception noted above). Therefore a change to the memory map in a device must be registered and acknowledged throughout the UVA regime. The CUDA driver has no idea who may use that allocation next.

It is Windows 10 64bit, newest drivers.

Thanks for the info. I suppose for TCC mode the answers are the same, or ?

In fact, for test of multiple processes on multiple GPUs (case D) for simplicity I use the CUDA_VISIBLE_DEVICES variable so that the application sees only one GPU. But as I wrote, it seems that the answer to (D) is always true, even if CUDA_VISIBLE_DEVICES variable is set properly so that only one GPU is seen.

I thinkin TCC mode the negative effect of the ‘global’ serialization of GPU memory allcoations are not as pronounced because teh allocations are about an order of magnitude faster (see https://devtalk.nvidia.com/default/topic/963440/cudamalloc-pitch-significantly-slower-on-windows-with-geforce-drivers-gt-350-12/)