Hi @SanghoYeo,
I would like to understand what is major concern related to the profiling session ?
Profiling overhead
OR
Memory leak
The reason to use cuptiFinalize() is to reduce profiling overhead I believe as per your initial comment.
I’m actively trying to figure out the leaks and fix it but in the meantime I’d like to check with you and see if your use-case requires cuptiFinalize() to be used or there are other ways to support your use case.
I think the work around you tried should help you reduce the profiling overhead i.e. by disabling the CUPTI activities.
Also one more thing along with disabling the CUPTI activities you should also disable the CUPTI callbacks as well.
If you are using the cupti_finalize sample, the sample subscribes to all CUDA Driver and Runtime API callbacks.
Those should also be disabled by calling the below 2 APIs along with disabling the CUPTI activities.
So to end the profiler session you could do something like this:
cuptiActivityDisable(); …
cuptiEnableDomain(0, injectionGlobals.subscriberHandle, CUPTI_CB_DOMAIN_RUNTIME_API);
cuptiEnableDomain(0, injectionGlobals.subscriberHandle, CUPTI_CB_DOMAIN_DRIVER_API);
And while starting the profiling session you can enable the CUPTI activities and CUPTI callbacks (if needed).
I’m not really sure you need CUPTI callbacks, you might let me know if it is or not but if your end goal is to get activity records, then subscribing to callbacks is not necessary at all.
I think this should help in reducing the profiling overhead when you want a phase in your application where you do not want to use CUPTI to profile anything.
This was just the generic information I’d like to give focused towards reducing the profiling overhead in general using CUPTI with the sample that you are using.
CUPTI allocates device memory of 9 MB to provide timestamps for GPU related activities per-context by default. CUPTI will allocate more buffers if required.
CUPTI also optimizes by re-using the device buffers if possible rather than allocating new device buffer every time when required.
So if you just disable CUPTI activities, CUPTI will not free the device buffers that are allocated and will use the same buffers in your next profiling session when you enable the activities.
With cuptiFinalize(), CUPTI will free the device buffers and allocate new ones again after you attach CUPTI again.
I’ll try to answer some of your queries now,
I found that setting CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE to a very low value (below 128KB) prevents the memory leak. However, this also results in the loss of information collected through Activity, making this method impractical.
From my analysis and debugging the device buffers should not be causing any memory leak as we make sure to free the buffers when cuptiFinalize() is called. Something else is causing the memory leak.
The other part of setting the device buffer size to 128 KB, there is a limit set in CUPTI as to how many buffers can be allocated. The limit is set to 250. So ideally CUPTI allocates 3 buffers of size 3 MB each i.e. total 9 MB by default.
So I think 128 KB is small size for the buffer and then CUPTI might be trying to allocate more than 250 buffers to store the required data and causing an out of memory sort of situation.
The limit can also be change setting the attribute CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_PRE_ALLOCATE_VALUE to the value you want. (default being 250)
When collecting only CONCURRENT_KERNEL activity, the profiling overhead was relatively high with low batch sizes due to low computational complexity per kernel, but significantly lower in the opposite scenario. With larger batch sizes, the overhead was less than 1% for certain models, and even when the overhead was higher, it was around 4%. Additionally, while continuously maintaining the profiling session, there were moments when CPU memory usage suddenly spiked during execution, but it returned to previous levels and the execution remained stable.
Could I know which CUPTI version you are using ?
In CUPTI there was an attribute added CUPTI_ACTIVITY_ATTR_PER_THREAD_ACTIVITY_BUFFER. This attribute was added in CUPTI starting CUDA 12.3 Toolkit.
So if you are using CUPTI from CUDA 12.3 or later you can set this attribute to 1. (default is 0)
Through internal testing with benchmarks like Gromacs we have noticed a decent amount overhead reduction.
So I’m hoping this would lead in reduction of overhead for you as well a bit if there are multiple threads in your application doing CUDA work.
With single thread as well there is some improvement but not as high as if the application was multi-threaded.
I’d like you to try the above two things and let me know if it help you out,
- Make sure you disable CUPTI activities and CUPTI callbacks at the end of the profiling session
- Set CUPTI_ACTIVITY_ATTR_PER_THREAD_ACTIVITY_BUFFER to 1 if you have the CUPTI version with the attribute present.
I’m sharing a fork of the sample cupti_finalize with the things I suggested and ideally should be decent enough for your use-case. I am removing subscription to callbacks but if you need it, you uncomment that piece of code.
Link: CUPTI Finalize Sample Fork - Google Docs
In the meantime, I’m trying to work on the GPU leaks.