nvprof collects events for a kernel in isolation i.e. by serializing the kernels in the application, so that events can be attributed to a specific kernel. This helps user understand and analyse the optimization opportunities for each kernel separately. If the specified events/metrics can’t be profiled in a single run of the application, nvprof by default replays each kernel multiple times until all the events/metrics are collected. The --replay-mode option can be used to change the replay mode. In “application replay” mode, nvprof re-runs the whole application instead of replaying each kernel, in order to collect all events/metrics.
When collecting events/metrics, nvprof profiles all kernels launched on all visible CUDA devices by default. The profiling scope can be limited to a specific context, stream, kernel or kernel invocation. More details about profiling scope can be found at http://docs.nvidia.com/cuda/profiler-users-guide/index.html#profiling-scope";