Profiler speed

mahmood.nt · December 5, 2022, 8:34am

Hi
I did a test to see why profiling is slow in one application and is faster in another application. When I use one of the following metrics:

l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum
l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum

Each kernel needs one pass. However, if I use them together, each kernel needs 3 passes. I also see that 3 passes is slow in one application while it is faster in another application. What I want to know, are:

How can I get more information about the passes? I know that is part of the internal tool, but counting load instructions doesn’t need multiple passes, intuitively. When the profiler encounters a load instruction, it increments a counter. Isn’t that correct?
Why counting load instructions is slow in one application and fast in another application. In my test, the slow application is ssd-mobilenet running inside the docker and the fast application is something running on host terminal. Both are using CuDNN functions. I understand that different number of instructions in the kernel can be a cause, but I am not sure if that is the only and most important one. For example, if a kernel has 100K load instructions, the profiling will be slower than a kernel with 1K load instructions. But is that the only reason?

Any thoughts on that?

felix_dt · December 5, 2022, 8:53am

Each of these metrics needs a single pass by themselves, but they can’t be collected in the same pass. Therefore, if requesting both of them, at least two passes are collected.

I know that is part of the internal tool, but counting load instructions doesn’t need multiple passes, intuitively. When the profiler encounters a load instruction, it increments a counter. Isn’t that correct?

Your intuition mostly applies to SASS metrics which are collected using SW-patching. These can be identified by having “sass” in their name, or being part of the Source Metrics in the Metrics Reference (there are two general types of sass metrics). While all SASS metrics of each group can be collected in the same pass, the same behavior does not apply to HW-counters like l1tex__t_sectors_*.

The tool’s logic is that if metrics requiring only a single pass in total are requested, they are collected as-is, to avoid kernel replay and memory save/restore. If metrics requiring more than one pass are requested (so that workload replay is needed anyways), an additional pass is enabled to collect the list of executed functions. This is the third pass you are encountering. It is currently not possible to disable this behavior for runs with HW-metrics. You can disable it using the appropriate environment variable when collecting only instruction-level SASS metrics, in which case the number of replays is reduced from two to one.

Why counting load instructions is slow in one application and fast in another application

Besides the size of the measured kernels, the size of the involved CUDA libraries can also be a factor (probably not in your case, if both are using the same library), as different numbers of functions need to be patched. The amount of memory to be saved and restored for each pass is also relevant. See the overhead section of the documentation. You could check if using application replay, which does not require memory save/restore, can improve your results.

mahmood.nt · December 5, 2022, 9:09am

Besides the size of the measured kernels, the size of the involved CUDA libraries can also be a factor (probably not in your case, if both are using the same library), as different numbers of functions need to be patched. The amount of memory to be saved and restored for each pass is also relevant. See the overhead section of the documentation. You could check if using application replay, which does not require memory save/restore, can improve your results.

They are not using the same CuDNN kernel. I did an additional test with nvbit opcode tool and I saw that ssd-mobilenet has more LD* instructions. For example:

LDG.E.128=219488
LDGDEPBAR=750880
LDGSTS.E.BYPASS.LTC128B.128=3119040
LDS=1039680
LDSM.16.M88.4=3927680

vs.

LDC.U8=16384
LDG.E.CONSTANT=785408
LDS.128=1572864

in the faster application. This is just for LD* instructions. The same applies for integer and etc. instructions.

I was wondering if variability of opcodes and the speed of profiling opcode types, e.g. LDGSTS vs. LDS, are also sources of this slow down.

Greg · December 19, 2022, 10:36pm

The number of LD and LDGSTS can impact the Nsight Compute pass that Felix referenced that is used to save time on the save/restore as the overhead of that pass is highly dependent upon the number of memory instructions.

Collecting HW counters does not impact the kernel time.

You can compare the amount of data that needs to be saved and restored with the metrics:

profiler__replayer_bytes_mem_backed_up.avg
profiler__replayer_bytes_mem_backed_up.sum

system · January 2, 2023, 10:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.