My CUDA version is 12.1, while the Nsight version is 2023.1.0.0. I only found a way to profile specific kernels within specific invocations. So I was wondering how to profile the whole execution without the detailed profiled information on each kernel, which should be faster and more disk-saving.
Nsight Compute is designed to profile individual instances of running kernels to identify performance issues within them. Due to limitations on how metrics are collected, this usually requires saving and restoring the state before the kernel was launched, so it can be replayed many times. For this reason, and others, profiling the entire application and aggregating the data is not a directly supported feature.
There is the option to use Range Replay, which can aggregate data for multiple kernels in a range. In theory, if your application was small enough to store all the state changes for replay, you may be able to create a range around the entire thing, but that’s not explicitly what Range Replay was designed for.
The results are also available from the CLI and other formats, so you could do some manual aggregation at the end using scripts/spreadsheets etc…