cudaMalloc calls from multiple threads on same GPU / multiple processes on different GPUs - serializ...

HannesF99 · July 25, 2017, 7:03am

I have an partially GPU-accelerated algorithm with a few cudaMalloc (actually cudaMallocPitch) calls in it.
System is Windows 10, WDDM driver is used.

When invoking the algorithm from multiple CPU threads on the same GPU (in order to better utilize it), I do not get the expected speedup. I suspect it is due to the cudaMalloc calls.

I have tried to invoke the algorithm also in multiple processes (all using the same GPU) in order to better utilize the GPU, still I do not get the expected speedup.

Now my questions (for WDDM / TCC mode):
(A) Are cudaMalloc calls on the same GPU in multiple threads serialized ?
(B) Are cudaMalloc calls on different GPUs in multiple threads serialized ?
(C) Are cudaMalloc calls on the same GPU in multiple processes serialized ?
(D) Are cudaMalloc calls on different GPUs in multiple processes serialized ?

Robert_Crovella · July 25, 2017, 1:16pm

I assume win10 64bit, therefore UVA is in effect.

I would expect so. (The one possible exception might be case D if you are using CUDA_VISIBLE_DEVICES to make sure each process only has “their own” GPUs in view. Not sure about that.) Not only do cudaMalloc operations serialize, but they will synchronize other activity as described in the programming manual:

[url]Programming Guide :: CUDA Toolkit Documentation

The general advice is to avoid doing cudaMalloc during timing-sensitive operations. Allocate your buffers once, before you start the work, and then reuse them.

UVA forces all memory maps in the system to be harmonized (with the possible exception noted above). Therefore a change to the memory map in a device must be registered and acknowledged throughout the UVA regime. The CUDA driver has no idea who may use that allocation next.

HannesF99 · July 25, 2017, 1:36pm

It is Windows 10 64bit, newest drivers.

Thanks for the info. I suppose for TCC mode the answers are the same, or ?

In fact, for test of multiple processes on multiple GPUs (case D) for simplicity I use the CUDA_VISIBLE_DEVICES variable so that the application sees only one GPU. But as I wrote, it seems that the answer to (D) is always true, even if CUDA_VISIBLE_DEVICES variable is set properly so that only one GPU is seen.

I thinkin TCC mode the negative effect of the ‘global’ serialization of GPU memory allcoations are not as pronounced because teh allocations are about an order of magnitude faster (see https://devtalk.nvidia.com/default/topic/963440/cudamalloc-pitch-_significantly_-slower-on-windows-with-geforce-drivers-gt-350-12/)

Topic		Replies	Views
Interleaving cudaMalloc and kernels on multiple cpu threads - performance? CUDA Programming and Performance	6	1526	March 5, 2018
Questions about cudaMalloc Questions about runtime for cudaMalloc and cudaMemcpy CUDA Programming and Performance	1	3386	June 23, 2009
Paralelling cudaMalloc in different GPU cards CUDA Programming and Performance	1	641	April 18, 2015
cudaMalloc and kernel call do they need to be in the same thread CUDA Programming and Performance	1	11303	November 19, 2008
Thread safety of cudaMalloc and cudaFree with multiple GPUs CUDA Programming and Performance	3	2074	December 4, 2020
Performance problem when loading multiple GPU system with independent simulations CUDA Programming and Performance	11	293	June 28, 2024
cudamalloc slow on Kepler K10 CUDA Programming and Performance	9	1209	October 28, 2014
CudaMalloc fails when more of 2 linux process acces to the GPU 0 CUDA Programming and Performance	2	1187	February 24, 2009
cudaMalloc's taking different times CUDA Programming and Performance	3	1979	December 22, 2010
cudaMalloc and sharing between CPU threads CUDA Programming and Performance	0	4388	May 20, 2009

cudaMalloc calls from multiple threads on same GPU / multiple processes on different GPUs - serializ...

Related topics