I recently started migrating one of our CUDA heavy code to use the asynchronous functions instead of the sync ones. This means changing to cudaMallocAsync, cudaFreeAsync, cudaMemcpyAsync. One thing to note is that for cudaMemcpyAsync we still use pagable memory and not pinned but I did want the operation be tied to a specific stream.
However I have noticed a weird behavior as the reserved memory of the pool keeps growing in case I’m using more than one stream. The usedHigh peaks at a certain point but for some reason the memory pool still allocates memory from the pool instead of re-using it. I’m not sure why this happens in case when using multiple streams. Below are some snapshots at some intervals about the memory pool usage. In this I can clearly see when using two streams the reserved just keeps growing even in cases it should have enough memory to serve the allocations. In the single stream I observe the expected behavior of the reserved following the usedHigh and peaking a little above it and than always serving allocations from the pool.
I tried disabling cudaMemPoolReuseAllowOpportunistic and cudaMemPoolReuseAllowInternalDependencies yet the same strange behavior can be observed.
Now when I add an explicit cudaStreamSyncrhonize as expected the memory is freed. I’m of course trying to avoid synchronization when it’s not needed and from what I understood during cudaMemcpyAsync with pagable memory the call will block anyway.
Debugging with compute-sanitizer does also not yield any success as I believe it does the synchronization itself and in that case I can see the memory being returned to the OS.
I also tried creating some stand alone applications which kind of mimic the behavior but with no success.
Any suggestions how I could proceed further in finding out the cause of the issue when using multiple streams ?
Windows 10 Enterprise
Version 22H2
NVIDIA-SMI 546.29 Driver Version: 546.29 CUDA Version: 12.3
NVIDIA GeForce RTX 3070