Hi!
While profiling PyTorch kernels, I ran into some discrepancies between the times reported by NSight Compute and PyTorch profiler.
I followed this example to use NSight Compute, in which I admittedly swapped NSight Systems for NSight Compute, which does something of the form:
nb_iters = 20
warmup_iters = 10
for i in range(nb_iters):
optimizer.zero_grad()
# start profiling after 10 warmup iterations
if i == warmup_iters: torch.cuda.cudart().cudaProfilerStart()
# push range for current iteration
if i >= warmup_iters: torch.cuda.nvtx.range_push("iteration{}".format(i))
# push range for forward
if i >= warmup_iters: torch.cuda.nvtx.range_push("forward")
output = model(data)
if i >= warmup_iters: torch.cuda.nvtx.range_pop()
...
torch.cuda.cudart().cudaProfilerStop()
and followed this example to measure model execution time using cuda events, with code of the form:
def timed(fn):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
result = fn()
end.record()
torch.cuda.synchronize()
return result, start.elapsed_time(end) / 1000
...
eager_times = []
for i in range(N_ITERS):
inp = generate_data(16)[0]
with torch.no_grad():
_, eager_time = timed(lambda: model(inp))
eager_times.append(eager_time)
print(f"eager eval time {i}: {eager_time}")
print("~" * 10)
Notably, for certain kernels, the latter method gives me a runtime of around 0.1 ms = 100 us, whereas NCU reported runtime of ~ 5-10 us. To that end, I have a few questions:
Firstly, do I have to call synchronize after calling nvtx.range_pop()? There’s an unresponded-to comment in the first link asking the same question.
If not, I know that there are discrepancies between NCU and NSys/Pytorch’s CUPTI. I believe that this is explained here. If my code is correct, would this be what I’m seeing in terms of my profiling discrepancies?
If so (i.e. NCU doesn’t give the measurement I want), should I opt for PyTorch profiler or NSys?