Stream ordered Memory Allocator runs out of memory with multiple streams

marcel.fuzes · February 16, 2024, 2:50pm

I recently started migrating one of our CUDA heavy code to use the asynchronous functions instead of the sync ones. This means changing to cudaMallocAsync, cudaFreeAsync, cudaMemcpyAsync. One thing to note is that for cudaMemcpyAsync we still use pagable memory and not pinned but I did want the operation be tied to a specific stream.

However I have noticed a weird behavior as the reserved memory of the pool keeps growing in case I’m using more than one stream. The usedHigh peaks at a certain point but for some reason the memory pool still allocates memory from the pool instead of re-using it. I’m not sure why this happens in case when using multiple streams. Below are some snapshots at some intervals about the memory pool usage. In this I can clearly see when using two streams the reserved just keeps growing even in cases it should have enough memory to serve the allocations. In the single stream I observe the expected behavior of the reserved following the usedHigh and peaking a little above it and than always serving allocations from the pool.

I tried disabling cudaMemPoolReuseAllowOpportunistic and cudaMemPoolReuseAllowInternalDependencies yet the same strange behavior can be observed.

Now when I add an explicit cudaStreamSyncrhonize as expected the memory is freed. I’m of course trying to avoid synchronization when it’s not needed and from what I understood during cudaMemcpyAsync with pagable memory the call will block anyway.

Debugging with compute-sanitizer does also not yield any success as I believe it does the synchronization itself and in that case I can see the memory being returned to the OS.

I also tried creating some stand alone applications which kind of mimic the behavior but with no success.

Any suggestions how I could proceed further in finding out the cause of the issue when using multiple streams ?

Windows 10 Enterprise
Version 22H2
NVIDIA-SMI 546.29 Driver Version: 546.29 CUDA Version: 12.3
NVIDIA GeForce RTX 3070

Topic		Replies	Views
GPU stalls due to stream synchronization -- even when idle? CUDA Programming and Performance	3	1159	November 19, 2021
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1 Technical Blog	1	615	September 13, 2024
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2 Technical Blog	12	1151	September 12, 2023
Why would cumemAllocAsync want to "insert new stream dependencies"? CUDA Programming and Performance	9	448	April 8, 2023
Implicit synchronization CUDA Programming and Performance	6	3554	April 30, 2015
Cannot use Stream Ordered Async Memory Allocator with CUDA MPS CUDA Programming and Performance cuda	4	569	October 9, 2022
Memory leak with cudagraph? CUDA Programming and Performance	4	228	June 10, 2024
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	794	September 11, 2016
Large allocations with cudaMallocManaged slow down synchronization CUDA Programming and Performance	11	1582	October 26, 2020
cudaStreamAttachMemAsync behavior questions GPU-Accelerated Libraries	0	1649	September 19, 2016

Stream ordered Memory Allocator runs out of memory with multiple streams

Related topics