Memory leakage at the termination of CUPTI profiling session

Hello, I have been attempting to implement a reduction in profiling overhead through dynamic attach and detach using cupti_finalize. I have confirmed that this method allows for stable profiling sessions without segmentation faults, based on the sample code involving cupti_finalize. However, I have observed memory leaks on both the CPU and GPU sides with this code. This issue makes it challenging to utilize CUPTI for real-time profiling in actual work environments because it results in memory exhaustion in both CPU and GPU.

You can reproduce this issue with just the cupti_finalize examples provided in the sample in /usr/local/cuda/extras/CUPTI. For instance, the change in memory usage after a few minutes can be seen in the two images below.

[At start point of process]

[After some minutes]

Could you please check this issue?

Here is my progress report.

I was unable to resolve the memory leak that occurs when ending a profiling session. Therefore, it is not possible to reduce profiling overhead by terminating the profiling session.

To work around this issue, I tested disabling activity collection via cuptiActivityDisable without ending the profiling session. With this method, I found that setting CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE to a very low value (below 128KB) prevents the memory leak. However, this also results in the loss of information collected through Activity, making this method impractical.

Finally, to conclude this task, I examined the profiling overhead during training on various GPU devices in our company. When collecting only CONCURRENT_KERNEL activity, the profiling overhead was relatively high with low batch sizes due to low computational complexity per kernel, but significantly lower in the opposite scenario. With larger batch sizes, the overhead was less than 1% for certain models, and even when the overhead was higher, it was around 4%. Additionally, while continuously maintaining the profiling session, there were moments when CPU memory usage suddenly spiked during execution, but it returned to previous levels and the execution remained stable.

Hi, @SanghoYeo

Thanks for the detailed info !
Our CUPTI dev is investigating this issue. Will let you know once there is any update。

Hello @veraj,
I was wondering if there has been any progress on this issue?

Hi @SanghoYeo,

I would like to understand what is major concern related to the profiling session ?
Profiling overhead
OR
Memory leak

The reason to use cuptiFinalize() is to reduce profiling overhead I believe as per your initial comment.

I’m actively trying to figure out the leaks and fix it but in the meantime I’d like to check with you and see if your use-case requires cuptiFinalize() to be used or there are other ways to support your use case.

I think the work around you tried should help you reduce the profiling overhead i.e. by disabling the CUPTI activities.
Also one more thing along with disabling the CUPTI activities you should also disable the CUPTI callbacks as well.

If you are using the cupti_finalize sample, the sample subscribes to all CUDA Driver and Runtime API callbacks.
Those should also be disabled by calling the below 2 APIs along with disabling the CUPTI activities.

So to end the profiler session you could do something like this:
cuptiActivityDisable(); …
cuptiEnableDomain(0, injectionGlobals.subscriberHandle, CUPTI_CB_DOMAIN_RUNTIME_API);
cuptiEnableDomain(0, injectionGlobals.subscriberHandle, CUPTI_CB_DOMAIN_DRIVER_API);

And while starting the profiling session you can enable the CUPTI activities and CUPTI callbacks (if needed).

I’m not really sure you need CUPTI callbacks, you might let me know if it is or not but if your end goal is to get activity records, then subscribing to callbacks is not necessary at all.

I think this should help in reducing the profiling overhead when you want a phase in your application where you do not want to use CUPTI to profile anything.

This was just the generic information I’d like to give focused towards reducing the profiling overhead in general using CUPTI with the sample that you are using.

CUPTI allocates device memory of 9 MB to provide timestamps for GPU related activities per-context by default. CUPTI will allocate more buffers if required.
CUPTI also optimizes by re-using the device buffers if possible rather than allocating new device buffer every time when required.

So if you just disable CUPTI activities, CUPTI will not free the device buffers that are allocated and will use the same buffers in your next profiling session when you enable the activities.

With cuptiFinalize(), CUPTI will free the device buffers and allocate new ones again after you attach CUPTI again.

I’ll try to answer some of your queries now,

I found that setting CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE to a very low value (below 128KB) prevents the memory leak. However, this also results in the loss of information collected through Activity, making this method impractical.

From my analysis and debugging the device buffers should not be causing any memory leak as we make sure to free the buffers when cuptiFinalize() is called. Something else is causing the memory leak.
The other part of setting the device buffer size to 128 KB, there is a limit set in CUPTI as to how many buffers can be allocated. The limit is set to 250. So ideally CUPTI allocates 3 buffers of size 3 MB each i.e. total 9 MB by default.
So I think 128 KB is small size for the buffer and then CUPTI might be trying to allocate more than 250 buffers to store the required data and causing an out of memory sort of situation.
The limit can also be change setting the attribute CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_PRE_ALLOCATE_VALUE to the value you want. (default being 250)

When collecting only CONCURRENT_KERNEL activity, the profiling overhead was relatively high with low batch sizes due to low computational complexity per kernel, but significantly lower in the opposite scenario. With larger batch sizes, the overhead was less than 1% for certain models, and even when the overhead was higher, it was around 4%. Additionally, while continuously maintaining the profiling session, there were moments when CPU memory usage suddenly spiked during execution, but it returned to previous levels and the execution remained stable.

Could I know which CUPTI version you are using ?
In CUPTI there was an attribute added CUPTI_ACTIVITY_ATTR_PER_THREAD_ACTIVITY_BUFFER. This attribute was added in CUPTI starting CUDA 12.3 Toolkit.
So if you are using CUPTI from CUDA 12.3 or later you can set this attribute to 1. (default is 0)
Through internal testing with benchmarks like Gromacs we have noticed a decent amount overhead reduction.
So I’m hoping this would lead in reduction of overhead for you as well a bit if there are multiple threads in your application doing CUDA work.
With single thread as well there is some improvement but not as high as if the application was multi-threaded.

I’d like you to try the above two things and let me know if it help you out,

  1. Make sure you disable CUPTI activities and CUPTI callbacks at the end of the profiling session
  2. Set CUPTI_ACTIVITY_ATTR_PER_THREAD_ACTIVITY_BUFFER to 1 if you have the CUPTI version with the attribute present.

I’m sharing a fork of the sample cupti_finalize with the things I suggested and ideally should be decent enough for your use-case. I am removing subscription to callbacks but if you need it, you uncomment that piece of code.
Link: CUPTI Finalize Sample Fork - Google Docs

In the meantime, I’m trying to work on the GPU leaks.

Hello @ambers

First, thank you for your answer.

As you mentioned, since my main purpose for using CUPTI is to collect CONCURRENT_KERNEL Activity, I don’t need to use callbacks. (The reason I used callbacks was to call cupti_finalize reliably.) Therefore, I have modified it to collect Activity periodically using cuptiActivityEnable and Disable.

Here are some simple experimental results. I experimented with bloom560m with a batch size of 1 and training 4 epochs. And for dynamic profiling case, I repeated the process of ActivityEnable for 5 seconds and ActivityDisable for 55 seconds.

Case Epoch Time(Sec) Elapsed Time(Sec)
Keep Profiling 50.65 227.36
Dynamic Profiling 47.91 220.63
No Profiling 45.80 208.80

As you can see from the above experiment, disabling collection through cuptiActivityDisable could reduce the load to some extent. However, just maintaining CUPTI’s profiling session (i.e., if not terminating the profiling session with cuptiFinalize()), even if activity collection is disabled through cuptiActivityDisable, there is still some load. In the end, the best solution seems to be to make the load 0 in sections where activity is not collected through cuptiFinalize.

PS 1. As you mentioned, I also activated the PER_THREAD_ACTIVITY_BUFFER option for testing, but there was no significant difference in performance load. Please check the table below.

Case Epoch Time(Sec) Elapsed Time(Sec)
Keep Profiling 50.65 227.36
Keep Profiling with PER THREAD ACTIVITY 50.70 227.33

PS 2. I observed that after cuptiActivityDisable is first called in the thread, the load due to CUPTI becomes very large until collection is reactivated through cuptiActivityEnable. For example, in my experiment, during the first epoch, this interval is affected, increasing the execution time compared to the original as shown below. This problem can be bypassed by setting the first cuptiActivityDisable interval to a short time.

Case First Epoch Time(Sec) Elapsed Time(Sec)
Dynamic Profiling 48.29 220.63
Dynamic Profiling w/o short first activity disable time 62.51 233.52

For now, I’ll be improving my program based on the ActivityDisable method you suggested :D
In the meantime, I’ll be hoping that you can resolve the memory leak issue with cuptiFinalize!

Thanks @SanghoYeo for your reply with some numbers to pile it up.
Good to know that performance overhead is the primary concern here and you’ve taken care of your requirement and got rid of CUPTI callbacks.

I’m working to fix the leaks. GPU leaks are turning out to be tricky to identify,
I’ll be hoping to fix the CPU leaks sooner as I have identified a few of them already while working on this.

I think you missed to answer as to which CUPTI version you are using. Could you let me know ?

Out of curiosity, What are your expectations with regards to overhead ?
As I see with dynamic profiling the overhead is around ~5%.
I know 0% is the optimal dream but what is a realistic/acceptable profiling overhead you’d expect ?

This problem can be bypassed by setting the first cuptiActivityDisable interval to a short time.

I did not get this part.
Are you trying to disable the same activities in all threads or just talking in terms of one thread carrying out the disabling of CUPTI activities ?
CUPTI activities are enabled/disabled at global level using cuptiActivityEnable/Disable() APIs.

I experimented with bloom560m with a batch size of 1 and training 4 epochs.

https://huggingface.co/bigscience/bloom-560m/tree/main: Is this the one you are talking about ?
What are steps involved to run this ?
It would help me as well to get his resolved.
If possible for you it would be great to get the injection that you are using to run CUPTI and steps to setup bllom-560m and run it with the injection.

As you mentioned, I also activated the PER_THREAD_ACTIVITY_BUFFER option for testing, but there was no significant difference in performance load.

Interesting. Do you know how many threads are there in the benchmark you experimented with ?

Thanks.

Hello @ambers,
I’m glad to hear that the CPU memory leak has been somewhat resolved! And thank you for checking my response! Here I’ve summarized the answers to your questions:

Q: I think you missed to answer as to which CUPTI version you are using. Could you let me know ?

A: I’ll explain using the CUPTI Runtime version I used when I discovered the issue and the version used during build.
For the build version, I used libcupti.so.2023.3.1.
At runtime, I confirmed that the libcupti library being loaded is the one installed along with torch. I checked the API version at runtime. The version confirmed through cuptiGetVersion is 18.

Q: Out of curiosity, What are your expectations with regards to overhead ?
As I see with dynamic profiling the overhead is around ~5%.
I know 0% is the optimal dream but what is a realistic/acceptable profiling overhead you’d expect ?

A: I’m aiming to perform profiling for about 5 seconds every 60 seconds, or even if we extend the profiling time interval (up to 2 minutes?), to have a profiling overhead within 1%.

Q: I did not get this part.
Are you trying to disable the same activities in all threads or just talking in terms of one thread carrying out the disabling of CUPTI activities ?
CUPTI activities are enabled/disabled at global level using cuptiActivityEnable/Disable() APIs.

A: Since I disabled it using cuptiActivityDisable, it seems to be a global level deactivation (probably for all threads).

Q: https://huggingface.co/bigscience/bloom-560m/tree/main: Is this the one you are talking about ?
What are steps involved to run this ?
It would help me as well to get his resolved.
If possible for you it would be great to get the injection that you are using to run CUPTI and steps to setup bllom-560m and run it with the injection.

The reason I used the bloom560m model was because there was a relatively high performance overhead(8~10%) depending on whether profiling was done or not. However, since this model’s code requires downloading the model and dataset, it might be inconvenient for testing. So, I’m sharing a simple test code and the injection library I used in a repo.

Container image used: pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel
Injection CUPTI library: samples/cupti_finalize at master · Yeosangho/samples · GitHub

Here is env vars used in my cupti injection lib.

  • ENABLE_DYNAMLC=1, it performs dynamic activity collection.
  • FREQ_A: Time when activity collection is activated after the first Activity Disabled
  • FREQ_B: Time when activity collection is deactivated after the first Activity Disabled
  • ENABLE_PER_THREAD_ACTIVITY=1 : Activate per thread activity

Test code: samples/pytorch-cifar10 at master · Yeosangho/samples · GitHub

  • Execution command: env {some env var setting for injection library} python main_wo_cudaprofiler.py

Performance change with test code

Case Epoch Time Performance difference based on No Profiling
Profiling Enabled 5.58 1.06
Profiling Disabled 5.32 1.015
No Profiling(No Injected Library) 5.24 1

Q: Interesting. Do you know how many threads are there in the benchmark you experimented with ?

A: I used a torch application. I’m not entirely sure in which operations PyTorch uses multi-threading.

Hello @ambers,
Would you check if there has been any progress on the issue?