I am working with a cuda-aware MPI fortran program. In this program, I have three subroutines which will cycle ten thousands times. In each of them, I allocate and deallocate memory. And I also transfer data through different nodes using MPI_SENDRECV.
I use two tesla P100 in ubuntu. That are the background.
There is a memory leak in my program. I am sure I deallocate every dynamic request for memory. But when I use the cudaMemGetinfo, I find some deallocate order is invalid, that is I deallocate 10 arrays but only 9 succeed. IF I run my program in one node or when I only run one of three subroutines, there is no memory leak.
I google it, someone told me that it is beause when I use MPI_SENDRECV to transfer GPU memory directly, Unified Memory will record the memory.
By the way, I try to use cudaDeviceReset, it failed.
I donot know how to solve this bug. Do you have any suggestions?