Inconsistent CUDA Kernel Execution Times in Sequential Execution

woosungkang · June 10, 2024, 12:29pm

Hi all,

I’ve encountered an issue with inconsistent execution times when running a simple CUDA kernel sequentially in a single process. Specifically, I’m performing a basic vector addition (A + B = C). The first execution takes about 10ms, but the second execution shows an almost 0ms execution time. I suspect that the kernels are somehow being launched simultaneously, despite my intention to execute them sequentially.

To illustrate, if I run two kernels with execution times of 10ms and 15ms respectively, the second one incorrectly reports an execution time of 5ms when executed sequentially.

Here are the details of my setup:

CUDA Version: 12.5
Array Size: 50* 1024* 1024
Threads Per Block: 1024
Blocks Per Grid: Calculated based on array size and threads per block

Below is the full code:

#include <iostream>
#include <iomanip>
#include <chrono>
#include <thread>
#include <assert.h>
#include <vector>
#include <stdio.h>
#include <string.h>
#include <time.h>

#include <cuda.h>
#include <cuda_runtime.h>

static inline void checkDrvError(CUresult res, const char *tok, const char *file, unsigned line) {
    if (res != CUDA_SUCCESS) {
        const char *errStr = NULL;
        (void)cuGetErrorString(res, &errStr);
        std::cerr << file << ':' << line << ' ' << tok << "failed (" << (unsigned)res << "): " << errStr << std::endl;
    }
}

#define CHECK_DRV(x) checkDrvError(x, #x, __FILE__, __LINE__);
typedef std::chrono::nanoseconds ReportingDuration;

#define CHECK_CUDA(ans) {check_cuda((ans), __FILE__, __LINE__);}
inline void check_cuda(int code, const char *file, int line, bool abort=true) {
    if (code != 0) {
        fprintf(stderr, "[customHook][%s:%3d]:CUDAERROR: %d\n", file, line, code);
        if (abort) exit(code);
    }
}

__global__ void add(int *a, int *b, int *c, int nElements) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < nElements) {
        c[i] = a[i] + b[i];
    }
}

double what_time_is_it_now() {
    struct timespec time;
    if (clock_gettime(CLOCK_MONOTONIC, &time) == -1) exit(-1);
    return (double)time.tv_sec + (double)time.tv_nsec * 0.000000001;
}

int main() {
    size_t array_size = 50*1024*1024;
    int nElements = array_size / sizeof(int);
    int *A, *B, *C;
    int threadsPerBlock = 1024;
    int blocksPerGrid = (nElements + threadsPerBlock - 1) / threadsPerBlock;
    double start, end;

    assert(array_size % (2*1024*1024) == 0);

    CHECK_CUDA(cudaMalloc(&A, array_size));
    CHECK_CUDA(cudaMalloc(&C, array_size));
    CHECK_CUDA(cudaMalloc(&B, array_size));

    // Fill up the A & B
    int *host_A, *host_B;
    host_A = (int *)malloc(array_size);
    host_B = (int *)malloc(array_size);
    for (int i = 0; i < nElements; i++) {
        host_A[i] = 2;
        host_B[i] = 4;
    }

    CHECK_CUDA(cudaMemcpy(A, host_A, array_size, cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(B, host_B, array_size, cudaMemcpyHostToDevice));

    start = what_time_is_it_now();
    add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, nElements);
    end = what_time_is_it_now();
    printf("VectorAdd: %.4fms\n", (end - start) * 1000);

    start = what_time_is_it_now();
    add<<<blocksPerGrid, threadsPerBlock>>>(B, C, A, nElements);
    end = what_time_is_it_now();
    printf("VectorAdd: %.4fms\n", (end - start) * 1000);

    return 0;
}

Questions:

Why does the second kernel execution report a much shorter execution time?
Is there a possible overlap or concurrency issue with the kernel launches?
How can I ensure that the kernels execute sequentially and measure their execution times accurately?

Any insights or suggestions would be greatly appreciated.

Thanks in advance!

striker159 · June 10, 2024, 1:10pm

Kernel launches are asynchronous with respect to the cpu. This means the cpu does not block automatically until the kernel is finished. That is why your current timing code only measures the time it takes to submit the kernel launch. It does not measure the execution time of the kernels.

You need to explicitly wait for kernel completion, for example using cudaDeviceSychronize();

    start = what_time_is_it_now();
    add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, nElements);
    CHECK_CUDA(cudaDeviceSynchronize());
    end = what_time_is_it_now();
    printf("VectorAdd: %.4fms\n", (end - start) * 1000);

    start = what_time_is_it_now();
    add<<<blocksPerGrid, threadsPerBlock>>>(B, C, A, nElements);
    CHECK_CUDA(cudaDeviceSynchronize());
    end = what_time_is_it_now();
    printf("VectorAdd: %.4fms\n", (end - start) * 1000);

woosungkang · June 11, 2024, 2:07am

Hi @striker159,

Thank you for your advice. I tried incorporating cudaDeviceSynchronize() into my code, but I am still experiencing the same issue. Additionally, if cudaDeviceSynchronize() was the problem, I would expect the first execution time to also be almost 0ms. However, the first execution time is measured correctly, and only the second execution shows an incorrect, significantly reduced time.

Any other thoughts?

njuffa · June 11, 2024, 2:41am

(1) make sure you compile for the architecture of the GPU you intend to run on, otherwise the first kernel invocation will incur JIT-compilation overhead

(2) make sure the GPU is idle before recording start time

(3) run the kernel more than two times. There are various “cold-start” effects that can affect the first few invocations of a kernel. You would want to reach “steady state” where kernel runtimes stabilize.

(4) especially for kernels whose performance is bound by memory throughput, it is common to encounter a significant amount of “noise” in execution measurements. It is impossible to recreate the exact same HW state for each kernel invocation, and small differences in initial state of the memory hierarchy can cause noticeable run-time differences (butterfly effect). A useful heuristic is to run such a kernel ten times, then report the fastest time measured.

Here is how I changed your code:

    for (int i = 0; i < 10; i++) {
        cudaDeviceSynchronize(); // wait until previous work complete
        start = what_time_is_it_now();
        add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, nElements);
        cudaDeviceSynchronize(); // wait until kernel complete
        end = what_time_is_it_now();
        printf("VectorAdd: %.4fms\n", (end - start) * 1000);
    }

Representative sample output on my system looks like so:

 VectorAdd: 0.6592ms 
 VectorAdd: 0.4921ms
 VectorAdd: 0.8597ms
 VectorAdd: 0.7317ms
 VectorAdd: 0.7215ms  <<< steady state reached
 VectorAdd: 0.7245ms
 VectorAdd: 0.7214ms
 VectorAdd: 0.7287ms 
 VectorAdd: 0.7231ms
 VectorAdd: 0.7201ms

It appears that steady state is reached after the first four iterations.

Robert_Crovella · June 11, 2024, 1:09pm

nowadays, its expected that the first kernel launch for a particular kernel will take significantly longer than subsequent launches for that kernel. There were factors that influence that prior to CUDA 12.2, but since CUDA 12.2, lazy loading is another factor that contributes to this significantly.

Curefab · June 11, 2024, 3:51pm

You could also measure kernel execution time with a profiler (Compute Nsight) to better know, what is going on and compare those results with yours.

system · June 25, 2024, 3:52pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1728	July 19, 2022
Getting Different Execution Times of Running Same Kernel Twice CUDA Programming and Performance	1	19	August 13, 2024
First kernel execution takes longer CUDA Programming and Performance	8	2855	December 8, 2014
Question about CUDA kernels parallel execution CUDA Programming and Performance cuda , parallel-computing	7	2539	April 27, 2024
Inconsistent kernel run times CUDA Programming and Performance	12	5784	August 5, 2009
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	676	April 4, 2017
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9346	January 7, 2008
Time consuming comparison between 820M GPU combination code and pure C++ on CPU CUDA Programming and Performance	2	488	May 22, 2019
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12707	July 26, 2010
Execution timings varying from instance to instance CUDA Programming and Performance	10	481	September 29, 2023

Inconsistent CUDA Kernel Execution Times in Sequential Execution

Related topics