NSight Compute vs. NSight Systems vs. PyTorch Profiler

Hi!

While profiling PyTorch kernels, I ran into some discrepancies between the times reported by NSight Compute and PyTorch profiler.

I followed this example to use NSight Compute, in which I admittedly swapped NSight Systems for NSight Compute, which does something of the form:

nb_iters = 20
warmup_iters = 10
for i in range(nb_iters):
    optimizer.zero_grad()

    # start profiling after 10 warmup iterations
    if i == warmup_iters: torch.cuda.cudart().cudaProfilerStart()

    # push range for current iteration
    if i >= warmup_iters: torch.cuda.nvtx.range_push("iteration{}".format(i))

    # push range for forward
    if i >= warmup_iters: torch.cuda.nvtx.range_push("forward")
    output = model(data)
    if i >= warmup_iters: torch.cuda.nvtx.range_pop()

    ...

torch.cuda.cudart().cudaProfilerStop()

and followed this example to measure model execution time using cuda events, with code of the form:

def timed(fn):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    result = fn()
    end.record()
    torch.cuda.synchronize()
    return result, start.elapsed_time(end) / 1000
...

eager_times = []
for i in range(N_ITERS):
    inp = generate_data(16)[0]
    with torch.no_grad():
        _, eager_time = timed(lambda: model(inp))
    eager_times.append(eager_time)
    print(f"eager eval time {i}: {eager_time}")

print("~" * 10)

Notably, for certain kernels, the latter method gives me a runtime of around 0.1 ms = 100 us, whereas NCU reported runtime of ~ 5-10 us. To that end, I have a few questions:

Firstly, do I have to call synchronize after calling nvtx.range_pop()? There’s an unresponded-to comment in the first link asking the same question.

If not, I know that there are discrepancies between NCU and NSys/Pytorch’s CUPTI. I believe that this is explained here. If my code is correct, would this be what I’m seeing in terms of my profiling discrepancies?

If so (i.e. NCU doesn’t give the measurement I want), should I opt for PyTorch profiler or NSys?

Hi, @azhaoy22

Thanks for using develop tools. As our dev replied in the I get different time in ncu and pytorch prolifer - #2 by felix_dt
For the sake of measuring pure kernel runtime, it is recommended to rely on CUPTI or Nsight Systems. For measuring kernel-level performance metrics, it is recommended to rely on Nsight Compute.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.