Unexpected behavior with cudaMallocAsync memory reuse

Hello,

I just stubbled on an issue with the stream-ordered asynchronous allocator. It does not reuse memory freed previously in the stream.

Consider the following example program. It asynchronously allocates and frees 25 GiB of memory, and then asynchronously allocates and frees 30 GiB.

#include <cstdio>

#ifndef CHECK
#define CHECK(status) do { check((status), __FILE__, __LINE__); } while(false)
#endif
inline void check(cudaError_t error_code, const char *file, int line)
{
    if (error_code != cudaSuccess)
    {
        fprintf(stderr, "CUDA Error %d: %s. In file '%s' on line %d\n", error_code, cudaGetErrorString(error_code), file, line);
        fflush(stderr);
        exit(error_code);
    }
}

// takes about 2 seconds on A100 with iter=123456789
__global__ void long_kernel(int iter)
{
    printf("Kernel started\n");
    size_t result = 0;
    for(size_t i = 0; i < iter; i++)
    {
        result ^= i * result + (i - 3) * i;
    }
    if(result == 0) printf("Result is zero\n");
    printf("Kernel finished\n");
}

int main(int argc, const char ** argv)
{
    size_t alloc_size_1 = size_t{25} << 30;
    size_t alloc_size_2 = size_t{30} << 30;

    if(argc > 1) alloc_size_1 = atoll(argv[1]) << 30;
    if(argc > 2) alloc_size_2 = atoll(argv[2]) << 30;

    void * ptr_1 = nullptr;
    void * ptr_2 = nullptr;

    printf("Starting program\n");

    printf("Attempting to allocate %zu GiB\n", alloc_size_1 >> 30);
    CHECK(cudaMallocAsync(&ptr_1, alloc_size_1, 0));
    printf("Allocation successfull: %p\n", ptr_1);
    long_kernel<<<1,1>>>(123456789);
    CHECK(cudaFreeAsync(ptr_1, 0));

    printf("Attempting to allocate %zu GiB\n", alloc_size_2 >> 30);
    CHECK(cudaMallocAsync(&ptr_2, alloc_size_2, 0));
    printf("Allocation successfull: %p\n", ptr_2);
    long_kernel<<<1,1>>>(123456789);
    CHECK(cudaFreeAsync(ptr_2, 0));

    printf("Waiting to finish\n");
    CHECK(cudaDeviceSynchronize());
    printf("Program finished\n");

    return 0;
}

Compile e.g. with nvcc source.cpp -o program.x. Run e.g. with ./program.x 25 30 (the arguments are the sizes of the first and second allocations, in GiB).

I run it on a 40-GiB GPU (A100), and the second allocations fails (cudaError=2, out of memory). This indicates that the memory was not actually reused.

In general, if the second allocation needs less or equal memory than the first, the program works fine. But if the second allocation is larger than the first one, then the program works only if the sum of the allocations fits into memory capacity. So on 40-GiB GPU, ./program.x 35 30, ./program.x 19 20 and ./program.x 21 20 work fine, but ./program.x 20 21 results in out-of-memory.

CUDA 13.0.0, driver 580.65.06

In the programming guide, I see everywhere text about memory reuse, specifically here:

In order to service an allocation request, the driver attempts to reuse memory that was previously freed via cudaFreeAsync() before attempting to allocate more memory from the OS. For example, memory freed in a stream can be reused immediately in a subsequent allocation request on the same stream.

But I don’t observe this memory-reuse behavior. So, my main question here is: is this behavior is expected? What is going on, how does the memory reuse actually work?

I found a post referencing a similar issue, but I did not really understand the “solution” there, and there are phrases like “I guess”, “it seems” etc., so no definitive answer. But it indicates that this is maybe expected behavior.

Anyway, from the point of view of a person new to cudaMallocAsync, this can be very unexpected, and this behaviour should at leas be explicitly mentioned in the programming guide and/or the documentation.

Thanks,

Jakub


edit:

From the cudaMallocAsync blog post:

If a memory allocation request made using cudaMallocAsync can’t be serviced due to fragmentation of the corresponding memory pool, the CUDA driver defragments the pool by remapping unused memory in the pool to a contiguous portion of the GPU’s virtual address space.

This text indicates that what I observe is actually bug behavior, and not expected.


another edit:

So I played with it for a while, and the issue seems to actually be with how the memory pool is (not) extended, as was mentioned in the post I mentioned. Per my observation and the response in the post, the pool cannot extend itself. If there are 10GB of free memory remaining in the pool, but an allocation requires 12GB, the pool is extended by another 12 GB, instead of only 2GB.

Per my observation, if the pool has extended e.g. from 10GB by 20 GB to 30GB, a 30GB allocation then succeeds without an issue, so the pool seems to not be fragmented, it just grows weird.

And for a workaround, querrying how much free memory is available and alloc-free-ing 99 % of that amount at the start of the program fixes the strange behavior.

Another observation (on a 40GB GPU):

alloc(30) to create a 30-GB mempool
free(30)
alloc(10), will be at the start of the pool
alloc(15), will be in the middle of the pool
free(10)
current pool (virtual addresses): | 10GB free | 15GB occupied | 5GB free
alloc(12), fails, even though there are 15 GB in the pool and 25GB total free

So fragmentation is actually an issue. This seems to contradict what I cited from the cudaMallocAsync blog post.

My main issue here is - from the programming guide and the blog post, cudaMallocAsync looks like a silver bullet that solves all your problems. But the drawbacks I just experienced are not mentioned anywhere.

The memory reuse was the main thing I needed from cudaMallocAsync, but it failed me bad.

1 Like

Disclaimer: I replied in the linked post, and talking about memory pool internals is still just my guess work.

Another work-around for this behavior is to synchronize the stream after deallocation.

This probably guarantees to the pool that the physical memory is no longer accessed by the GPU and could safely be remapped if necessary.