Why doesn't the instrumented execution time match the execution time captured by nsys?

Forgive me for asking what may seem like a silly question.

When I measure time using instrumentation, I only get a few milliseconds. Here’s my measurement method:

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()
// module
end_event.record()

torch.cuda.synchronize()

elapsed_time = start_event.elapsed_time(end_event)
print(f"total time: {elapsed_time} ms")

However, when I look at the timeline captured by nsys, I can see that there’s a 1-second gap just between two batch norm kernel calls.

Is this because it only records the sum of kernel execution times and doesn’t include idle periods between them? Or is there an issue with my measurement method?

What I actually want to obtain is the entire execution time (not including initial and final data transfers).

Additionally, could I use the time elapsed between CUDA profiling initialization and CUDA profiling data flush as my total time (including time for H to D or D to H data transfers)?

I usually suggest that people with torch questions ask on a torch forum like discuss.pytorch.org. There are NVIDIA experts that patrol that forum.