Can nvprof show summed-up results on a metric/event, other than (min/max/avg) of kernels?

While using nvprof, I’d like to get summed-up results (events and metrics) from multiple or different kernels.

e.g., lets say, from code like this:

test.cpp

cudaProfilerStart();
while (…) {

some_cuda_kernel_A<<<…>>>
some_cuda_kernel_B<<<…>>>

}
cudaProfilerStop();

it would be good for me to get ‘dram_read_bytes’ metric that is summed up through multiple execution, multiple kernels.

Does anybody know the fastest way to do this, or should I code some wrapper for cupti api?


Seems that non-Tesla GPU doesn’t support this, so I made a wrapper.