Benchmarking CUDA applications how to minimize source code modifications?

Hi, I’ve been wondering what are the best ways to benchmark (larger) CUDA applications & modify the source code as little as possible.

Ideally, I’d like to get the execution time of the whole app., each CUDA kernel, each CPU function, combined execution time of concurrent kernels, etc in one .csv file.

Is there maybe a way to include execution time of functions executed on the CPU in the command line profiler output file?