I’d like to understand how cudaEvents are recorded and how that affects the way kernel launches are timed.
Consider the following code timing a kernel call from a larger program,
cudaEventRecord(start); kernel_1<<<...,...>>>(...); cpu_func(); // Is this recorded ? cudaEventRecord(stop); [...] cudaEventSynchronise(stop); cudaEventElapsedTime(&result, start, stop)
I’m guessing that calls to cudaEventRecord merely specifies the order of events, i.e. “start” -> “kernel” -> “stop”, where start and stop are “virtual” events specified to happen with respect to an actual event of running the “kernel_1” kernel. Therefore, cudaEventElapsedTime will not record the time taken by cpu_func. Please correct me if I’m wrong. (Also, I’m currently not in a position to check this by myself).
Considering the above code snippet without calling
cudaEventRecord(start); kernel_1<<<...,...>>>(...); cudaEventRecord(stop); [...] cudaEventSynchronise(stop); (&result, start, stop)
Would nvprof’s summary stats provide a more accurate value of the time elapsed (assuming I’m calling kernel_1 only once) compared to calculating it using cudaEvents and cudaEventElapsedTime?