How do CUPTI Event Instances and Domains relate to hardware architecture?

Hi everyone.
I have a Jetson AGX Xavier with a 512-core Volta GPU.
I developed a CUPTI-based profiler to study the behavior of the GPU under specific workloads; for each benchmark, it profiles all events exposed by the GPU.

However, I am having an hard time correlating CUPTI entities with the actual hardware architecture of the GPU.
CUPTI documentation talks about events, domains, instances and groups. As a reference for events and domains exposed by my GPU, see the following sheet I generated with CUPTI API.

Volta GPU Event counters.xlsx (8.1 KB)

I have few questions about this matter:

  • is there any hardware relation among the events grouped under the same domain? What do event domains represent, from the point of view of hardware architecture?
  • domains containing events related to computational units (i.e. domains A and D) have 8 instances; with the deviceQuery CUDA sample I found that my Volta GPU has 512 CUDA cores subdivided in 8 multiprocessors.
    Do the instances available for an event reflect the number of multiprocessors in a GPU (i.e. we have 1 event instance per multiprocessor)? Does this mean that only a streaming processor is tracked per each multiprocessor, and that it is supposed to represent the whole multiprocessor under the assumption of well-balanced load?
    In other words the question is, how do event instances map to the hardware architecture?
  • L2-cache-related events (i.e. domain E) only have 4 instances. Additionally, they seem to be subdivided in events for “slice 0 of L2” and events for “slice 1 of L2”. Can you help me clarify this mapping of L2-related events to the GPU actual hardware?
  • for events from domains A and D, all the instances (8 instances) of the same event, for each given benchmark, seem to be very much correlated (load balanced over all instances); however, the events from domain E (4 instances) show a different behavior: the first 2 instances of each event (instances 0 and 1) show some activity, while the other 2 (instances 2 and 3) always present a constant 0 value, independently on the benchmark. Do you have any hardware clue for this? Is the L2 cache organized in a specific way that reflects this events/instances organization? Are some portion of the cache disabled by default so that only 2 event instances are profiling something?

Thanks for you interest if you kept reading until here, and I hope for an interesting discussion :)