CUPTI memory overheads


I’m dealing with CUPTI memory overheads, and have spotted a memory usage increase when CUPTI is enabled (for callback APIs in my case). It seems that this memory usage also scales with the number of CPU cores.

Part of the memory reservation happens during cuptiSubscribe().

Is there a way to avoid this memory reservation or specify the number of cores that the process under profiling executes on? I have a process running only on a subset of CPUs. For this process, it would be unnecessary to allocate buffers for all the CPU cores.


Hi Ming,

CUPTI memory footprint should not scale with the number of CPU cores. Does your application allocate resources (like CUDA context, CUDA module load etc) based on the number of CPU cores? Based on your experiments, what is the amount of memory that is used by CUPTI per CPU core?

For tracing, CUPTI memory overhead is documented at CUPTI :: CUPTI Documentation. This section is specific to device memory usage. CUPTI allocates certain resources in the device and/or pinned host memory for each CUDA context.

Hi mjain! Thanks for your reply!

I observed the memory overhead much higher than the 3 * 3 MB per context mentioned in the document. From the heap profile of my application, cuptiSubscribe allocated 1.3GB on an 8-core machine, even if I do not enable any domain. A flame graph showing the call stack is attached. I also verified that this 1.3GB is freed during cuptiUnsubscribe, which means it’s managed rather than leaked.

In case any third-party lib allocates lots of CUDA contexts, I tried setting CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE to 1024 bytes and CUPTI_ACTIVITY_ATTR_MEM_ALLOCATION_TYPE_HOST_PINNED to 0, but still got the same 1.3GB overhead.

Q: Any insight into what this memory is allocated for, and whether there is a way to reduce it? Thanks!

I also profiled a CUPTI sample callback_timestamp (cuda-11.7-x86_64/opt/cuda/extras/CUPTI/samples/callback_timestamp). It shows about 45MB memory overhead due to CUPTI for one CUDA context, when cuptiSubscribe is called, resulting in total memory usage of 53MB. A memory usage flame graph is below.

In comparison, when CUPTI is not enabled (cuptiSubscribe is not called), total memory usage of callback_timestamp is about 8MB as shown below.

Hi Ming,

Do you observe high memory overhead only when cuptiSubscribe is called after CUDA is initialized and CUDA module/s are loaded? CUPTI sample callback_timestamp calls cuptiSubscribe before CUDA initialization (cuInit), thus memory overhead is low, but it might increase at a later point when CUDA modules are loaded.

I wonder if high memory allocation is due to the fact that CUPTI stores the data related to the CUDA modules loaded by the application i.e. the cubin image, which can be large in the size. CUPTI frees this memory when the module is unloaded.

Hi Mjain,

Thanks for your reply. Exactly like what you said. There are two cases:

  1. When cuptiSubscribe is called after CUDA context is initiated, memory allocation can be observed during cuptiSubscribe.
  2. When cuptiSubscribe is called before CUDA context is initiated, memory allocation can be observed at a later point, usually when CUDA modules are loaded.

Could you follow up on the lead of CUPTI storing additional data? This memory overhead is significant when we have many CUDA modules.
Is there any way to reduce this memory overhead?

Thanks in advance.


Thanks Ming for confirming that memory overhead is due to the storing of additional data for CUDA modules by CUPTI. Currently there is no way to reduce it. We plan to work on it and check if memory footprint can be reduced in an upcoming release of CUPTI.

And thanks for reporting this issue.