How to get exact measurement of CPU and GPU running time?

faiyazsadi · August 12, 2023, 6:00pm

For the example code of calculating dot product using both CPU and GPU how do I get the correct execution time of CPU and GPU separately? I have looked into nvprof, ncu and Cuda Events. Which one should be the go to?

#include <stdio.h>
#include <stdlib.h>

#define N (2048 * 9999) // number of elements
#define M 512  // threads per block

void fill(int *a) {
    for(int i = 0; i < N; ++i) {
        a[i] = 1;
    }
}

void cpu_dot_product(int *a, int *b, int *c) {
    int sum = 0;
    for(int i = 0; i < N; ++i) {
        sum += a[i] * b[i];
    }
    *c = sum;      
}

__global__ void gpu_dot_product(int *a, int *b, int *c) {
    __shared__ int temp[M];
    int index = threadIdx.x + blockIdx.x * blockDim.x;

    if(index < N) {
        temp[threadIdx.x] = a[index] * b[index];
    }
    __syncthreads();

    if(threadIdx.x == 0) {
        int sum = 0;
        for(int i = 0; i < M; ++i) {
            sum += temp[i];
        }
        atomicAdd(c, sum);
    }
    
}

int main() {
    int *a, *b, *c;
    int *da, *db, *dc;
    int size = N * sizeof(int);


    a = (int *) malloc(size);
    fill(a);
    b = (int *) malloc(size);
    fill(b);
    c = (int *) malloc(sizeof(int));

    cpu_dot_product(a, b, c);
    printf("%d ", *c);
    printf("\n"); 
    *c = 0;


    cudaMalloc((void **) &da, size);
    cudaMalloc((void **) &db, size);
    cudaMalloc((void **) &dc, sizeof(int));

    cudaMemcpy(da, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(db, b, size, cudaMemcpyHostToDevice);
    

    gpu_dot_product<<<((N + M - 1) / M), M>>>(da, db, dc);

    cudaMemcpy(c, dc, sizeof(int), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();

    printf("%d ", *c);
    printf("\n"); 

    free(a);
    free(b);
    free(c);

    cudaFree(da);
    cudaFree(db);
    cudaFree(dc);

    return 0;
}

Robert_Crovella · August 12, 2023, 7:05pm

If you don’t have any other needs, I would always suggest using a profiler for timing CUDA kernel duration (other methods can be made to work for various purposes). For older devices (pre-Volta) that would be nvprof. For newer devices, either ncu or nsys can give you this info. If you want to calculate the duration of something that is purely host code (such as the duration of your cpu_dot_product function), that topic is not unique or specific to CUDA. I’m sure you can find recommendations for this on any number of forums. If you search out my posts here on this forum or on SO, you will find many examples of me using host-based timing for both host and device activity; this is often sufficient for the demonstration purposes I have, and doesn’t unnecessarily burden the discussion with the complexity of using the profiler. Here is an example of me using nsys, here is an example of me using host-based timing.

The forums are littered with questions discussing these tradeoffs in some detail.

Most of my work is on linux. The answer isn’t hugely different on windows. However there will be some detail differences, such as the exact syntax and method you might use for host based timing. You can find examples where njuffa has provided a handy routine that works for either linux or windows. Here is an example.

When doing host-based timing of device activity, it’s usually good practice not to try and get super complicated with streams, threads, and whatnot (KISS principle), and generally good practice to include a cudaDeviceSynchronize() before the start of your timing region, and another cudaDeviceSynchronize() before you close the timing region. This helps to make sure that other GPU asynchronous activity doesn’t pollute the things you are trying to observe.

If you need timing of complex activity, I strongly suggest the use of the profiler. Use the visual timeline to be sure you are looking at the things you are intending to look at.

system · August 26, 2023, 7:05pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to calculate CPU time and GPU time CUDA Programming and Performance	4	1450	April 14, 2018
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	5980	September 8, 2009
Number of GPU clock cycles CUDA Programming and Performance	15	10274	June 16, 2017
How to calculate CPU time and GPU time General	2	2273	October 12, 2021
CUDA OpenCL comparison CUDA Programming and Performance	9	3399	August 23, 2011
Issues with measuring speedup timing analysis for CUDA CUDA Programming and Performance	0	733	July 3, 2010
Different GPU execution times using cuda events and cudaprof CUDA Programming and Performance	0	918	January 21, 2011
Compare GPU and CPU function time CUDA Programming and Performance	7	6306	May 30, 2011
100% CPU usage when running CUDA code CUDA Programming and Performance	5	4946	October 31, 2023
How to measure total time for CPU and GPU CUDA Programming and Performance	8	23722	September 14, 2017

How to get exact measurement of CPU and GPU running time?

Related topics