NSight Compute vs. NSight Systems vs. PyTorch Profiler

azhaoy22 · February 19, 2024, 4:32pm

Hi!

While profiling PyTorch kernels, I ran into some discrepancies between the times reported by NSight Compute and PyTorch profiler.

I followed this example to use NSight Compute, in which I admittedly swapped NSight Systems for NSight Compute, which does something of the form:

nb_iters = 20
warmup_iters = 10
for i in range(nb_iters):
    optimizer.zero_grad()

    # start profiling after 10 warmup iterations
    if i == warmup_iters: torch.cuda.cudart().cudaProfilerStart()

    # push range for current iteration
    if i >= warmup_iters: torch.cuda.nvtx.range_push("iteration{}".format(i))

    # push range for forward
    if i >= warmup_iters: torch.cuda.nvtx.range_push("forward")
    output = model(data)
    if i >= warmup_iters: torch.cuda.nvtx.range_pop()

    ...

torch.cuda.cudart().cudaProfilerStop()

and followed this example to measure model execution time using cuda events, with code of the form:

def timed(fn):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    result = fn()
    end.record()
    torch.cuda.synchronize()
    return result, start.elapsed_time(end) / 1000
...

eager_times = []
for i in range(N_ITERS):
    inp = generate_data(16)[0]
    with torch.no_grad():
        _, eager_time = timed(lambda: model(inp))
    eager_times.append(eager_time)
    print(f"eager eval time {i}: {eager_time}")

print("~" * 10)

Notably, for certain kernels, the latter method gives me a runtime of around 0.1 ms = 100 us, whereas NCU reported runtime of ~ 5-10 us. To that end, I have a few questions:

Firstly, do I have to call synchronize after calling nvtx.range_pop()? There’s an unresponded-to comment in the first link asking the same question.

If not, I know that there are discrepancies between NCU and NSys/Pytorch’s CUPTI. I believe that this is explained here. If my code is correct, would this be what I’m seeing in terms of my profiling discrepancies?

If so (i.e. NCU doesn’t give the measurement I want), should I opt for PyTorch profiler or NSys?

veraj · February 22, 2024, 7:19am

Hi, @azhaoy22

Thanks for using develop tools. As our dev replied in the I get different time in ncu and pytorch prolifer - #2 by felix_dt
For the sake of measuring pure kernel runtime, it is recommended to rely on CUPTI or Nsight Systems. For measuring kernel-level performance metrics, it is recommended to rely on Nsight Compute.

veraj · March 23, 2024, 7:19am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
I get different time in ncu and pytorch prolifer Nsight Compute	3	1248	August 26, 2024
How to control profiling start time using Nsight System gui like --capture-range=cudaProfilerApi in cli Profiling Linux Targets nsight	12	3831	April 4, 2023
Kernel time of Nsight system is larger than nsight compute Profiling Linux Targets	11	914	April 3, 2024
nsight-compute's profiling result is different from nvprof's Nsight Compute	5	612	October 12, 2021
Question about profiling nccl kernels with Nsight Compute Nsight Compute	20	4804	February 13, 2025
Nsight compute failed to profile with nvtx ranges in pytorch Nsight Compute pytorch , profiling	4	1117	September 19, 2023
Cycles in nsight-compute and nsight-systems Nsight Compute	2	1213	October 26, 2022
Nsight Computer with PyTorch Nsight Compute pytorch , deep-learning-profiler	0	1531	December 23, 2020
Kernel Performance Discrepancy in Nsight Compute and Systems Nsight Compute	2	159	December 2, 2024
Profiling fails on more than one gpu device Nsight Compute	9	982	November 15, 2023

NSight Compute vs. NSight Systems vs. PyTorch Profiler

Related topics