CUPTI memory overheads


I’m dealing with CUPTI memory overheads, and have spotted a memory usage increase when CUPTI is enabled (for callback APIs in my case). It seems that this memory usage also scales with the number of CPU cores.

Part of the memory reservation happens during cuptiSubscribe().

Is there a way to avoid this memory reservation or specify the number of cores that the process under profiling executes on? I have a process running only on a subset of CPUs. For this process, it would be unnecessary to allocate buffers for all the CPU cores.


Hi Ming,

CUPTI memory footprint should not scale with the number of CPU cores. Does your application allocate resources (like CUDA context, CUDA module load etc) based on the number of CPU cores? Based on your experiments, what is the amount of memory that is used by CUPTI per CPU core?

For tracing, CUPTI memory overhead is documented at CUPTI :: CUPTI Documentation. This section is specific to device memory usage. CUPTI allocates certain resources in the device and/or pinned host memory for each CUDA context.