I have a program that uses the CUPTI activity APIs to collect the page fault information during its execution to adjust how it executes. I observed that there is some memory leakage, and when the host memory is exhausted, the program encounters CUDA runtime error and fails. However, when I remove the CUPTI activity API invocations (not to enable) to NOT collect the page faults, the program has much less memory leakage and can complete its work.
So, I am wondering if the CUPTI activity API has some memory leakage issues. Except the CUPTI activity APIs, the program also uses the CUPTI callback APIs. So, I have tried to use cuptiFinalize(), but it causes runtime errors and the program crashes. My question is how to avoid memory leakage after I use the CUPTI activity APIs (when I no longer use the CUPTI activity APIs), or how to release the resources occupied by CUPTI activity APIs. Thank you.
Unfortunately API cuptiFInalize() doesn’t work as expected in many cases. In the current form this API requires user to ensure that all the work submitted on the GPU is completed by calling the appropriate CUDA synchronization API/s, and no application thread should submit any more work on the GPU until this API returns. We have identified and fixed issues in this API and hope to deliver a robust implementation in the next CUDA release which would not require any explicit user synchronization.
Do you disable the activity using the API cuptiActivityDisable() once done? Does it work?
We have been working on fixing the memory leakage in the CUPTI. Please provide more details about what CUPTI APIs did you use and what activities are enabled. It’d be better if you file a bug with more details like the CUDA toolkit version, GPU, OS etc. In general having a small reproducer help resolve the issue faster.
Thank you for your reply.
To tackle the overheads of collecting runtime activities through the CUPTI activity API, my program does not always enable the CUPTI activity API. Instead, the program enables the CUPTI activity API through cuptiActivityEnable() in a short time, and then it disables the CUPTI activity API through cuptiActivityDisable(). When the program needs to collect the runtime behaviors, the program would enable the CUPTI activity API and disable it again. So, the program enables and disables the CUPTI activity API repetitively. During the entire process, the used host memory increases and eventually the host memory is exhausted. I believe that cuptiActivityDisable() can stop the CUPTI activity API acquiring more host memory, but maybe not all occupied resources are released. I will try to provide more details about my program later.
To solve the memory leakage issue in my program, I am very interested in “cuptiFinalize() doesn’t work as expected in many cases”, and want to know how to make cuptiFinalize() works as expected. I tried to call cudaDeviceSynchronize() before I call cuptiFinalize(), but the situation does not change. What should I do before I call cuptiFinalize() can make it work as expected? Thank you.
Thank you for providing the workflow. Looking forward for more details from you about the memory leak. You can open a bug using the link https://developer.nvidia.com/nvidia_bug/add and add details.
Regarding crash in the cuptiFinalize(), it turned out to be a known issue and fix would be available in the next CUDA release.