If I want to profile the program performance, I can use the perf tool [perf][Perf Wiki] to collect the CPU micro-events such as :page-faults, branch-misses. And the Intel CPUs provides 4-6 registers in PMC(performance monitoring unit)for counting this event.
Question:
I want to profile my cuda programs by nvprof. There are 141 events in nvprof, such as l1_local_load_hit, l1_local_load_miss. [nvprof][Profiler :: CUDA Toolkit Documentation]. And how many registers supplied by PMC(performance monitoring unit)in NVIDIA GPU? My GPU products is K80, P100. Thanks!
I want to make sure How many events can I should collect at a time properly?
As Intel CPU profiler, it provides 4 registers in PMC(performance monitoring unit)for counting events. Then I should collect 4 events at a time advisably.
There are no fixed number of events that can be profiled in single pass. It depends on event/metric combination.
User can use “cuptiEventGroupSetsCreate” API in CUPTI to find the number of passes required by combination of event provided to cuptiEventGroupSetsCreate.