I wrote a simple program which constantly calling nvshmemx_init_attr and nvshmem_finalize in a loop. And when the program is running I could observe using nvidia-smi that the GPU memory usage keeps going up, by approximately 2MB per loop. So I think there might be memory leak somewhere. I’m using 3.3.9.
========= COMPUTE-SANITIZER
[Cycle 1/5] DEVICE CUDA API 12080
NVSHMEM v3.3.9
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: [0x31f662] in libcuda.so.1
========= Host Frame: nvshmemi_get_cucontext(nvshmemi_state_dec*) in init.cu:488 [0x1b20c] in demo
========= Host Frame: nvshmemi_common_init(nvshmemi_state_dec*) in init.cu:945 [0x1d981] in demo
========= Host Frame: nvshmemi_try_common_init(nvshmemi_state_dec*) in init.cu:1067 [0x1e826] in demo
========= Host Frame: nvshmemid_hostlib_init_attr in init.cu:1166 [0x1f230] in demo
========= Host Frame: nvshmemi_init_thread(int, int*, unsigned int, nvshmemx_init_attr_v1*, nvshmemi_version_t) in init_device.cu:142 [0x1cd8b6] in demo
========= Host Frame: nvshmemx_init_attr in nvshmemx_api.h:62 [0x18660] in demo
========= Host Frame: main in demo.cu:60 [0x18934] in demo
=========
Init OK, Finalize OK
[Cycle 2/5] DEVICE CUDA API 12080
Init OK, Finalize OK
[Cycle 3/5] DEVICE CUDA API 12080
Init OK, Finalize OK
[Cycle 4/5] DEVICE CUDA API 12080
Init OK, Finalize OK
[Cycle 5/5] DEVICE CUDA API 12080
Init OK, Finalize OK
Completed 5 cycles successfully
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 1 error
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
there is no memory leak being observed.
The error `CUDA_ERROR_INVALID_CONTEXT` probably has to do with Two-stage NVSHMEM init, where we defer the GPU device initialization until the first time the device is actually used by NVSHMEM. If you are curious about the details of this behavior, you can look around line 500 of init.cu.
As to where the 2MiB of memory is going for each iteration, I’d guess that it’s being used either internally by the NVSHMEM runtime or by the CUDA runtime.
I have about 140GB GPU memory in total. I run the demo I wrote for quite a while, and at that time it takes probably 60GB GPU memory (at first 2GB, grow gradually). Then I used another program to cudaMalloc about 80GB GPU memory, adding up near the maximum.
And after a short period of time, my demo would fail with:
[xx/src/device/launch/collective_launch.cpp:173] cuda failed with out of memory
/src/device/init/init_device.cu:78: non-zero status: 2 _nvshmemi_init_device_only_state failed
xx/src/device/init/init_device.cu:150: non-zero status: 7 nvshmem_internal_init_thread failed at init_device_only_state.
xx/build/src/include/host/nvshmemx_api.h:63: non-zero status: 7: Resource temporarily unavailable, exiting... aborting due to error in nvshmemi_init_thread
Though compute-sanitizer can’t find any leaks, it looks to me there is…
If you set NVSHMEM_DEBUG_SUBSYS=INIT,MEM and NVSHMEM_DEBUG=INFO, you should see prints everywhere NVSHMEM allocates and frees memory. You could inspect to see if they are not matching the way you expect.
I checked the log for heap_allocate and heap_deallocate, they match perfectly. So symmetric heap part is absolutely correct. From the official document, There is:
Multiple calls to nvshmem_init are allowed, and must be called by the same set of processes as the initial call to nvshmem_init.
Running my demo would eventually cause OOM, which does not align with the description above.
My demo is:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include "nvshmem.h"
#include "nvshmemx.h"
int main(int argc, char **argv) {
int cycles = -1; // -1 means run infinitely
int delay_ms = 100;
if (argc > 1) {
cycles = atoi(argv[1]);
if (cycles <= 0) cycles = -1;
}
if (argc > 2) {
delay_ms = atoi(argv[2]);
if (delay_ms < 0) delay_ms = 100;
}
if (cycles == -1) {
printf("Running continuously (Ctrl+C to stop), Delay: %d ms\n\n", delay_ms);
} else {
printf("Cycles: %d, Delay: %d ms\n\n", cycles, delay_ms);
}
int cycle = 1;
while (cycles == -1 || cycle <= cycles) {
if (cycles == -1) {
printf("[Cycle %d] ", cycle);
} else {
printf("[Cycle %d/%d] ", cycle, cycles);
}
nvshmemx_init_attr_t attr = NVSHMEMX_INIT_ATTR_INITIALIZER;
nvshmemx_uniqueid_t unique_id;
nvshmemx_get_uniqueid(&unique_id);
nvshmemx_set_attr_uniqueid_args(0, 1, &unique_id, &attr);
int status = nvshmemx_init_attr(NVSHMEMX_INIT_WITH_UNIQUEID, &attr);
if (status != 0) {
printf("FAILED - Init error %d\n", status);
return 1;
}
printf("Init OK, ");
nvshmem_finalize();
printf("Finalize OK\n");
if (delay_ms > 0) {
usleep(delay_ms * 1000);
}
cycle++;
}
if (cycles != -1) {
printf("\nCompleted %d cycles successfully\n", cycles);
}
return 0;
}
Maybe you could run it in your environment to see if that provides any insights into the issue? Thanks for your continued attention and help.
This is not the same thing as “any number of calls to nvshmem_init are guaranteed not to consume any hardware resources“.
When we do the two-stage init that I was talking about before, we track some state internally which allows the NVSHMEM runtime to be bootstrapped but not initialized after finalize is called. Also, we store some other bootstrap related information even after a finalize. Additionally, CUDA itself may track some state as well.
How many times in your use case would you need NVSHMEM to be able to call initialize and finalize? Typically, because initialize and finalize are relatively expensive, we see users initialize as few times as is practical.