Question about Nsight Tools Timestamping Mechanism and Clock Sources

I have a question regarding Nsight performance profiling - how it handles timestamping for events in the timeline. Specifically, I am curious about the clock sources that Nsight uses to generate the timestamps shown in the timeline.

Does Nsight utilize high-resolution system clocks, such as the Time Stamp Counter (TSC) or other clock sources, to timestamp events in the timeline? If so, could you provide more details on how Nsight synchronizes events across multiple GPUs, and how it manages the timing and synchronization of events within the timeline?

I’ve gone through the official documentation, but I haven’t found explicit information on whether Nsight reads from these clock sources directly.

I would appreciate any clarification or additional resources that explain how Nsight handles timestamping and time synchronization between events, especially across multiple GPUs.

Thanks in advance!

CUPTI and Nsight tools use OS high precision timers such as TSC (via RDTSC, std::chrono, …) for CPU timestamps. On each platform we try to use the highest precision timer with the lowest overhead including on mobile/embedded working with vendors to get user mode access to the timer.

The GPU timestamps are based upon an internal nanosecond clock that has a precision of approximately 32ns.

  • CUDA kernels can read via inline PTX %globaltimer or new CUDA C++ std::chrono.
  • Graphics shaders can read via various extensions.

On pre-GH100 GPUs the GPU internal clock only updated at 1 MHz. Tools increased to 31.25 MHz (32ns) but only during execution of the tool. On GH100+ the SM %globaltimer updates at 31.25 MHz (32 ns).

Other GPUs systems such as the hardware performance monitor and underlying command for timestamping use the same clock and output the timestamp with 32ns resolution.

The trace tools must regularly perform CPU ↔ GPU time synchronization as the CPU and GPU clocks will drift. The tools use linear interpolation to convert from GPU to CPU. For multiple GPUs the GPU timestamp is correlated to the CPU. For multi-node tools such as NSYS may also capture a network timestamp or allow an offset to be provided when importing the report.