I basically need to compare execution time between two kernels (kernels only). However one kernel is implemented in CUDA, while the other is generated through llvm and nvrtc pipeline, so I am not able to insert cudaEvent right before the second kernel execution start. Thus I measure the execution time of both kernels through nvprof invocation, i.e. extracting exec time from profilers output.
The question is whether this way of benchmarking is correct, i.e. does nvprof capture only kernel execution time? (For example I am not doing any kind of warm-up kernels invocations and measure the average running time through multiple running of nvprof rather than invoking kernels in a loop)