[DL Prof] Accumulating latency for the ancestors in the stack trace of an op

In dlprofviewer, under the ‘Ops And Kernels’ Tab, we are given a lot of useful metrics about the op name and the stack trace narrowing down the line of python source code( I am profiling a model inference in pytorch) that called the op. However, I also wanted metrics as follows:
Consider in the python source code we have 2 functions foo and bar. foo calls bar like so

def foo:
    bar()

Dlprof will show the stack trace as /foo/bar and the metric of the op called. In this case, I want the metric to be accumulated for foo() as well. And so on. Similarly, if the stack trace is /foo/bar/baz, I want the metric that is shown in dlprof for baz to be populated for bar and foo, the metric that is shown in dlprof for bar to be populated for foo and so on.
Is this functionality implemented in dlprof? Or in some other Nvidia developer tool? Or do we need to write logic for this on our own? Any advice in this regard would be helpful.