During the CUDA context creation time, CUPTI allocates a single buffer of size CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE. By default it’s set to 8 MB which can hold tracing information for ~0.25 M kernels in the concurrent kernel mode. This attribute is configurable and user can choose any value based on the requirement. CUPTI doesn’t allocate more buffers unless it’s required. Once device buffer is exhausted, CUPTI allocates another device buffer of the same size. Note that memory footprint will not scale with the kernel count because CUPTI reuses the buffer after processing all the records in the buffer.
In general, activity buffer flush should be independent of the device buffer size, but due to an optimization it has some dependency on the buffer size CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE, but it’s independent of the pool limit CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_POOL_LIMIT. This behavior will be improved in a future CUDA release, and we’d decouple flushing from the device buffer size. Activity buffers will be delivered as soon as those are ready to be consumed.
Refer Memory Overhead section of the CUPTI guide https://docs.nvidia.com/cupti/Cupti/r_main.html#unique_1148016283