While using nvprof, I’d like to get summed-up results (events and metrics) from multiple or different kernels.
e.g., lets say, from code like this:
test.cpp
…
cudaProfilerStart();
while (…) {
…
some_cuda_kernel_A<<<…>>>
some_cuda_kernel_B<<<…>>>
…
}
cudaProfilerStop();
…
it would be good for me to get ‘dram_read_bytes’ metric that is summed up through multiple execution, multiple kernels.
Does anybody know the fastest way to do this, or should I code some wrapper for cupti api?
–
Seems that non-Tesla GPU doesn’t support this, so I made a wrapper.