In Kernel Replay , all metrics requested for a specific kernel instance in NVIDIA Nsight Compute are grouped into one or more passes. For the first pass, all GPU memory that can be accessed by the kernel is saved. After the first pass, the subset of memory that is written by the kernel is determined.
(quoted from here)
Looks like it’s assuming the memory access of a kernel is deterministic for each pass. But why?
In my understanding, the kernel execution is parallel, so the memory access is not necessarily deterministic, or is there something I’m missing?
UPDATE
For correctly identifying and combining performance counters collected from multiple application replay passes of a single kernel launch into one result, the application needs to be deterministic with respect to its kernel activities and their assignment to GPUs, contexts, streams, and potentially NVTX ranges. Normally, this also implies that the application needs to be deterministic with respect to its overall execution.
So it indeed has this assumption, but what happens if this assumption is not true? And how often will this assumption be true?