Possible GPU memory leak

alvin-cyq · September 1, 2025, 7:21am

Dear nvshmem developers,

I wrote a simple program which constantly calling nvshmemx_init_attr and nvshmem_finalize in a loop. And when the program is running I could observe using nvidia-smi that the GPU memory usage keeps going up, by approximately 2MB per loop. So I think there might be memory leak somewhere. I’m using 3.3.9.

benjaming1 · September 2, 2025, 7:02pm

Could you try running your program with compute-sanitizer to pin down the source of the leak?

alvin-cyq · September 3, 2025, 4:01am

Yeah the first thing I did was to use this tool.

compute-sanitizer --tool memcheck --leak-check full <bin>

Unfortunately, no memory leak was found while nvidia-smi reports for increased GPU memory usage.

benjaming1 · September 3, 2025, 4:29am

Can you attach the results?

alvin-cyq · September 3, 2025, 4:39am

The result is:

========= COMPUTE-SANITIZER
[Cycle 1/5]   DEVICE CUDA API              12080
NVSHMEM v3.3.9
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
=========         Host Frame:  [0x31f662] in libcuda.so.1
=========         Host Frame: nvshmemi_get_cucontext(nvshmemi_state_dec*) in init.cu:488 [0x1b20c] in demo
=========         Host Frame: nvshmemi_common_init(nvshmemi_state_dec*) in init.cu:945 [0x1d981] in demo
=========         Host Frame: nvshmemi_try_common_init(nvshmemi_state_dec*) in init.cu:1067 [0x1e826] in demo
=========         Host Frame: nvshmemid_hostlib_init_attr in init.cu:1166 [0x1f230] in demo
=========         Host Frame: nvshmemi_init_thread(int, int*, unsigned int, nvshmemx_init_attr_v1*, nvshmemi_version_t) in init_device.cu:142 [0x1cd8b6] in demo
=========         Host Frame: nvshmemx_init_attr in nvshmemx_api.h:62 [0x18660] in demo
=========         Host Frame: main in demo.cu:60 [0x18934] in demo
=========
Init OK, Finalize OK
[Cycle 2/5]   DEVICE CUDA API              12080
Init OK, Finalize OK
[Cycle 3/5]   DEVICE CUDA API              12080
Init OK, Finalize OK
[Cycle 4/5]   DEVICE CUDA API              12080
Init OK, Finalize OK
[Cycle 5/5]   DEVICE CUDA API              12080
Init OK, Finalize OK

Completed 5 cycles successfully
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 1 error

I think the error is probably negligible?

benjaming1 · September 3, 2025, 4:45am

Based on this output:

========= LEAK SUMMARY: 0 bytes leaked in 0 allocations

there is no memory leak being observed.

The error `CUDA_ERROR_INVALID_CONTEXT` probably has to do with Two-stage NVSHMEM init, where we defer the GPU device initialization until the first time the device is actually used by NVSHMEM. If you are curious about the details of this behavior, you can look around line 500 of init.cu.

As to where the 2MiB of memory is going for each iteration, I’d guess that it’s being used either internally by the NVSHMEM runtime or by the CUDA runtime.

Let us know if you have any more questions.

alvin-cyq · September 3, 2025, 5:44am

Thanks for your reply.

I have two questions:

Do you mean it is safe for us to ignore this 2MB memory increase?
Is there a way to further locate this part of memory?

alvin-cyq · September 3, 2025, 11:03am

I tested like this:

I have about 140GB GPU memory in total. I run the demo I wrote for quite a while, and at that time it takes probably 60GB GPU memory (at first 2GB, grow gradually). Then I used another program to cudaMalloc about 80GB GPU memory, adding up near the maximum.

And after a short period of time, my demo would fail with:

[xx/src/device/launch/collective_launch.cpp:173] cuda failed with out of memory
/src/device/init/init_device.cu:78: non-zero status: 2 _nvshmemi_init_device_only_state failed

xx/src/device/init/init_device.cu:150: non-zero status: 7 nvshmem_internal_init_thread failed at init_device_only_state.

xx/build/src/include/host/nvshmemx_api.h:63: non-zero status: 7: Resource temporarily unavailable, exiting... aborting due to error in nvshmemi_init_thread

Though compute-sanitizer can’t find any leaks, it looks to me there is…

benjaming1 · September 3, 2025, 3:50pm

If you set NVSHMEM_DEBUG_SUBSYS=INIT,MEM and NVSHMEM_DEBUG=INFO, you should see prints everywhere NVSHMEM allocates and frees memory. You could inspect to see if they are not matching the way you expect.

alvin-cyq · September 4, 2025, 3:27am

I checked the log for heap_allocate and heap_deallocate, they match perfectly. So symmetric heap part is absolutely correct. From the official document, There is:

Multiple calls to nvshmem_init are allowed, and must be called by the same set of processes as the initial call to nvshmem_init.

Running my demo would eventually cause OOM, which does not align with the description above.

My demo is:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include "nvshmem.h"
#include "nvshmemx.h"

int main(int argc, char **argv) {
    int cycles = -1; // -1 means run infinitely
    int delay_ms = 100;

    if (argc > 1) {
        cycles = atoi(argv[1]);
        if (cycles <= 0) cycles = -1;
    }
    if (argc > 2) {
        delay_ms = atoi(argv[2]);
        if (delay_ms < 0) delay_ms = 100;
    }

    if (cycles == -1) {
        printf("Running continuously (Ctrl+C to stop), Delay: %d ms\n\n", delay_ms);
    } else {
        printf("Cycles: %d, Delay: %d ms\n\n", cycles, delay_ms);
    }

    int cycle = 1;
    while (cycles == -1 || cycle <= cycles) {
        if (cycles == -1) {
            printf("[Cycle %d] ", cycle);
        } else {
            printf("[Cycle %d/%d] ", cycle, cycles);
        }

        nvshmemx_init_attr_t attr = NVSHMEMX_INIT_ATTR_INITIALIZER;
        nvshmemx_uniqueid_t unique_id;

        nvshmemx_get_uniqueid(&unique_id);

        nvshmemx_set_attr_uniqueid_args(0, 1, &unique_id, &attr);

        int status = nvshmemx_init_attr(NVSHMEMX_INIT_WITH_UNIQUEID, &attr);
        if (status != 0) {
            printf("FAILED - Init error %d\n", status);
            return 1;
        }

        printf("Init OK, ");

        nvshmem_finalize();

        printf("Finalize OK\n");

        if (delay_ms > 0) {
            usleep(delay_ms * 1000);
        }

        cycle++;
    }

    if (cycles != -1) {
        printf("\nCompleted %d cycles successfully\n", cycles);
    }
    return 0;
}

Maybe you could run it in your environment to see if that provides any insights into the issue? Thanks for your continued attention and help.

benjaming1 · September 4, 2025, 5:28am

This is not the same thing as “any number of calls to nvshmem_init are guaranteed not to consume any hardware resources“.

When we do the two-stage init that I was talking about before, we track some state internally which allows the NVSHMEM runtime to be bootstrapped but not initialized after finalize is called. Also, we store some other bootstrap related information even after a finalize. Additionally, CUDA itself may track some state as well.

How many times in your use case would you need NVSHMEM to be able to call initialize and finalize? Typically, because initialize and finalize are relatively expensive, we see users initialize as few times as is practical.

Topic		Replies	Views
Potential memory leak - compute-sanitizer shows nothing CUDA Programming and Performance camera , cuda , jetson	10	389	September 10, 2024
Nvshmem fails to finalize GPU-Accelerated Libraries cuda , nvshmem	4	1107	January 16, 2024
BUG: call cudaFree(0) before nvshmem_init() makes nvshmem_barrier_all() fails GPU-Accelerated Libraries nvshmem	6	98	April 19, 2025
NVSHMEM program fails to initialize Other Tools	0	336	November 16, 2020
`cuCtxCreate` and `cuCtxDestroy` pairs have a memory leak CUDA Programming and Performance cuda , problem	9	1252	January 11, 2024
Compute Sanitizer for OpenAcc and OpenMPI Compute Sanitizer	2	1348	March 9, 2023
Potential NVSHMEM allocated memory performance issue GPU-Accelerated Libraries nvshmem	19	1486	May 10, 2024
How to find leaks? cuda-gdb runs out of memory, but compute-sanitizer runs without erros CUDA-GDB	9	4576	March 22, 2023
Memory leak using cufftmp GPU-Accelerated Libraries cufft	0	23	July 16, 2024
Is there a memory leak in CUDA CUDA Programming and Performance	6	7159	June 11, 2008

Possible GPU memory leak

Related topics