Implicit synchronization

Dude1205 · April 29, 2015, 6:23pm

Hi everyone,

I was not getting the expected concurrency when using multiple streams and realized the issue comes from a restriction detailed in chapter 3.2.5.5.4. Implicit Synchronization of the Programming guide:

"Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:

a page-locked host memory allocation,
a device memory allocation,
a device memory set,
a memory copy between two addresses to the same device memory,
any CUDA command to the NULL stream,
a switch between the L1/shared memory configurations described in Compute Capability 2.x and Compute Capability 3.x."

The issue is that I allocate pinned memory and device memory between two sets of operations in different streams. The solution is quite simple; I just need to allocate memory beforehand. But that implies that you know in advance exactly how much memory you need or that you have some sort of complex memory management mechanism.

My question therefore is, is this restriction likely to be removed in a future release?
If so, would anyone else like to see a functionality where memory allocations (and deallocations) could be streamed just like transfers and kernels? This way, memory could be used only where and when it is needed.

Robert_Crovella · April 29, 2015, 6:56pm

It’s not likely to be removed in a future release. Memory allocations cause a modification to the GPU virtual memory map. Updating GPU virtual memory map must be done when kernels are not running, thus the need for a device sync.

Gregory_Diamos · April 29, 2015, 8:13pm

I actually agree with Dude1205 that this behavior is annoying and it would be great if NVIDIA fixed this. You don’t expect std::malloc/std::free on a CPU to block on other threads.

Note that you can work around this limitation pretty easily by implementing your own malloc/free. cudaMalloc/cudaFree perform heavy-weight synchronizations and update GPU virtual memory maps, so you don’t have to try very hard to write a faster malloc/free. Just rounding up allocations to a pool size, sticking them in a std::map, and splitting/merging them on malloc/free calls is significantly faster and avoids the synchronizations.

The downside of doing this is that 1) applications will often use more GPU physical memory on average because free won’t immediately return allocations, 2) the memory checking tools will have a harder time detecting out of bounds accesses, and 3) since kernels are executed asynchronously, you need to make sure that calls to free are scheduled after kernels complete (a straightforward solution is to queue the free in the same stream).

tbenson · April 29, 2015, 8:18pm

Does “device memory set” include the cudaMemset*Async() functions? cudaMemset() uses the NULL stream, so I can see why that would force implicit synchronization, but why should cudaMemsetAsync() on a non-NULL stream force implicit synchronization?

Dude1205 · April 29, 2015, 8:46pm

@Gregory Diamos: Yes. It does not make sense to me that the GPU has to be completely idle in order to allocate new memory. I understand it might be simpler to implement virtual memory maps with this restriction. But in theory, lifting this restriction should be feasible.

Isn’t it already the case for device-side malloc anyway?

Gregory_Diamos · April 30, 2015, 12:52am

I think by definition device-side malloc does not required the GPU to be idle. It is allocated out of a reserved fixed-size heap in global memory, so you just get a virtual address that has already been mapped. It’s interesting to note that you could perform memory allocation asynchronously by wrapping device-side malloc/free in kernels.

Robert_Crovella · April 30, 2015, 1:03am

Except that pointers returned by device-side malloc are not usable for transfer of data to/from the host (e.g. via cudaMemcpy). That may or may not be important. Regarding the initial problem statement in this thread it might be important.

Topic		Replies	Views
Device memory allocation implicit synchronization CUDA Programming and Performance	2	958	July 25, 2020
CUDA implicit synchronization behavior and conditions in detail CUDA Programming and Performance	2	2473	April 29, 2023
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1 Technical Blog	1	787	September 13, 2024
What's the cudaMalloc's implicit synchronize means? CUDA Programming and Performance	0	101	June 17, 2025
cudaStream alloc after free result in oom CUDA Programming and Performance	7	242	December 18, 2024
GPU stalls due to stream synchronization -- even when idle? CUDA Programming and Performance	3	1295	November 19, 2021
Ambiguity in the description of cudaFree API? CUDA Programming and Performance cuda	2	512	April 1, 2024
CUDA streams questions CUDA Programming and Performance	1	1061	May 17, 2015
Asynchronous problem with cudaMalloc CUDA Programming and Performance	2	1081	May 22, 2023
Asynchronous cudaMalloc CUDA Programming and Performance	3	12000	July 2, 2012

Implicit synchronization

Related topics