How do CUPTI Event Instances and Domains relate to hardware architecture?

mrwolf · September 6, 2021, 2:08pm

Hi everyone.
I have a Jetson AGX Xavier with a 512-core Volta GPU.
I developed a CUPTI-based profiler to study the behavior of the GPU under specific workloads; for each benchmark, it profiles all events exposed by the GPU.

However, I am having an hard time correlating CUPTI entities with the actual hardware architecture of the GPU.
CUPTI documentation talks about events, domains, instances and groups. As a reference for events and domains exposed by my GPU, see the following sheet I generated with CUPTI API.

Volta GPU Event counters.xlsx (8.1 KB)

I have few questions about this matter:

is there any hardware relation among the events grouped under the same domain? What do event domains represent, from the point of view of hardware architecture?
domains containing events related to computational units (i.e. domains A and D) have 8 instances; with the deviceQuery CUDA sample I found that my Volta GPU has 512 CUDA cores subdivided in 8 multiprocessors.
Do the instances available for an event reflect the number of multiprocessors in a GPU (i.e. we have 1 event instance per multiprocessor)? Does this mean that only a streaming processor is tracked per each multiprocessor, and that it is supposed to represent the whole multiprocessor under the assumption of well-balanced load?
In other words the question is, how do event instances map to the hardware architecture?
L2-cache-related events (i.e. domain E) only have 4 instances. Additionally, they seem to be subdivided in events for “slice 0 of L2” and events for “slice 1 of L2”. Can you help me clarify this mapping of L2-related events to the GPU actual hardware?
for events from domains A and D, all the instances (8 instances) of the same event, for each given benchmark, seem to be very much correlated (load balanced over all instances); however, the events from domain E (4 instances) show a different behavior: the first 2 instances of each event (instances 0 and 1) show some activity, while the other 2 (instances 2 and 3) always present a constant 0 value, independently on the benchmark. Do you have any hardware clue for this? Is the L2 cache organized in a specific way that reflects this events/instances organization? Are some portion of the cache disabled by default so that only 2 event instances are profiling something?

Thanks for you interest if you kept reading until here, and I hope for an interesting discussion :)

Topic		Replies	Views
Reading all events through CUPTI CUDA Programming and Performance	10	2280	May 14, 2015
how to find out if a multiprocessor has a specific CUPTI event counter CUDA Programming and Performance	0	399	July 14, 2017
Hardware counters on CUDA device Available hardware counters for particular CUDA device CUDA Programming and Performance	0	515	June 28, 2012
CUPTI mapping of SM to instance CUDA Programming and Performance	6	1228	August 5, 2015
What granularity can I obtain via nvidia profiler Visual Profiler and nvprof	1	2651	July 23, 2013
Get events of a metric in CUPTI CUPTI – CUDA Profiler Tools Interface	2	688	February 23, 2021
CUPTI: Add more events to the trace CUDA Programming and Performance	3	791	November 16, 2016
Count number of events without enumerating them CUPTI – CUDA Profiler Tools Interface	2	879	December 15, 2021
CUPTI Event API Fails with Error CUPTI_ERROR_NOT_SUPPORTED on V100 CUPTI – CUDA Profiler Tools Interface	3	27	March 30, 2026
Get event metrics per thread or warp via CUPTI CUDA Programming and Performance	1	1465	June 14, 2013

How do CUPTI Event Instances and Domains relate to hardware architecture?

Related topics