Semantics of recording a cudaEvent | Accuracy of cudaEvents Vs nvprof

I’d like to understand how cudaEvents are recorded and how that affects the way kernel launches are timed.

Consider the following code timing a kernel call from a larger program,

cudaEventRecord(start);
kernel_1<<<...,...>>>(...);
cpu_func(); // Is this recorded ?
cudaEventRecord(stop);
[...]
cudaEventSynchronise(stop);
cudaEventElapsedTime(&result, start, stop)

I’m guessing that calls to cudaEventRecord merely specifies the order of events, i.e. “start” -> “kernel” -> “stop”, where start and stop are “virtual” events specified to happen with respect to an actual event of running the “kernel_1” kernel. Therefore, cudaEventElapsedTime will not record the time taken by cpu_func. Please correct me if I’m wrong. (Also, I’m currently not in a position to check this by myself).

Considering the above code snippet without calling cpu_func,

cudaEventRecord(start);
kernel_1<<<...,...>>>(...);
cudaEventRecord(stop);
[...]
cudaEventSynchronise(stop);
(&result, start, stop)

Would nvprof’s summary stats provide a more accurate value of the time elapsed (assuming I’m calling kernel_1 only once) compared to calculating it using cudaEvents and cudaEventElapsedTime?