How to get exact measurement of CPU and GPU running time?

For the example code of calculating dot product using both CPU and GPU how do I get the correct execution time of CPU and GPU separately? I have looked into nvprof, ncu and Cuda Events. Which one should be the go to?

#include <stdio.h>
#include <stdlib.h>

#define N (2048 * 9999) // number of elements
#define M 512  // threads per block

void fill(int *a) {
    for(int i = 0; i < N; ++i) {
        a[i] = 1;

void cpu_dot_product(int *a, int *b, int *c) {
    int sum = 0;
    for(int i = 0; i < N; ++i) {
        sum += a[i] * b[i];
    *c = sum;      

__global__ void gpu_dot_product(int *a, int *b, int *c) {
    __shared__ int temp[M];
    int index = threadIdx.x + blockIdx.x * blockDim.x;

    if(index < N) {
        temp[threadIdx.x] = a[index] * b[index];

    if(threadIdx.x == 0) {
        int sum = 0;
        for(int i = 0; i < M; ++i) {
            sum += temp[i];
        atomicAdd(c, sum);

int main() {
    int *a, *b, *c;
    int *da, *db, *dc;
    int size = N * sizeof(int);

    a = (int *) malloc(size);
    b = (int *) malloc(size);
    c = (int *) malloc(sizeof(int));

    cpu_dot_product(a, b, c);
    printf("%d ", *c);
    *c = 0;

    cudaMalloc((void **) &da, size);
    cudaMalloc((void **) &db, size);
    cudaMalloc((void **) &dc, sizeof(int));

    cudaMemcpy(da, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(db, b, size, cudaMemcpyHostToDevice);

    gpu_dot_product<<<((N + M - 1) / M), M>>>(da, db, dc);

    cudaMemcpy(c, dc, sizeof(int), cudaMemcpyDeviceToHost);

    printf("%d ", *c);



    return 0;

If you don’t have any other needs, I would always suggest using a profiler for timing CUDA kernel duration (other methods can be made to work for various purposes). For older devices (pre-Volta) that would be nvprof. For newer devices, either ncu or nsys can give you this info. If you want to calculate the duration of something that is purely host code (such as the duration of your cpu_dot_product function), that topic is not unique or specific to CUDA. I’m sure you can find recommendations for this on any number of forums. If you search out my posts here on this forum or on SO, you will find many examples of me using host-based timing for both host and device activity; this is often sufficient for the demonstration purposes I have, and doesn’t unnecessarily burden the discussion with the complexity of using the profiler. Here is an example of me using nsys, here is an example of me using host-based timing.

The forums are littered with questions discussing these tradeoffs in some detail.

Most of my work is on linux. The answer isn’t hugely different on windows. However there will be some detail differences, such as the exact syntax and method you might use for host based timing. You can find examples where njuffa has provided a handy routine that works for either linux or windows. Here is an example.

When doing host-based timing of device activity, it’s usually good practice not to try and get super complicated with streams, threads, and whatnot (KISS principle), and generally good practice to include a cudaDeviceSynchronize() before the start of your timing region, and another cudaDeviceSynchronize() before you close the timing region. This helps to make sure that other GPU asynchronous activity doesn’t pollute the things you are trying to observe.

If you need timing of complex activity, I strongly suggest the use of the profiler. Use the visual timeline to be sure you are looking at the things you are intending to look at.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.