First run is slower than the following

odranoel · October 21, 2012, 8:29pm

Hi there,

I noticed this across my benchmarks: the first time a kernel is run, it takes longer, regardless of the usage of acc_init(0).

For each benchmark I have, I run it several times and take the average time of the executions.
I use clock() to measure time, and the main loop is basically:

for (int i=0; i<num_tests; i++) {
  time_t begin = clock();
  benchmark;
  time_t end = clock();
}

I found out that the first iteration would take longer than the subsequent ones.
On the beginning I thought it was because I had forgotten the acc_init() in the beginning of my program, but it actually doesn’t make much of a difference.

I then supposed that the compiler was re-using the data, instead of copying it over again.
I did a small experiment: I created another input/output arrays, with different data, and performed an “empty” benchmark with the fake data before running my real benchmark.
This way, if there was some sort of clever re-using of data, it wouldn’t affect my program: well, it still worked faster.

Then what I did was to change the “warm-up” benchmark to a different thing. I was doing a matrix transpose, but then I wrote a simple array copy with openacc and run that as my warm-up (the benchmark was still the transpose, but the warm-up was the copy).
It didn’t make any difference: regardless of which warm-up, the code ran faster than the warm-up-less code.

Apparently, the first openacc invocation that happens STILL needs to perform some initialization.

Here are some numbers for the matrix transpose 8192x8192:
No warm-up: 0.362 seconds (run 5 times, max: 0.39 min: 0.32)

Accelerator Kernel Timing data
/home/lechat/tcc/transpose/src/trans_acc.c
  trans
    9: region entered 1 time
        time(us): total=40,066 init= region=40,065
                  kernels=39,934
        w/o init: total=40,065 max=40,065 min=40,065 avg=40,065
        12: kernel launched 1 times
            grid: [128x2048]  block: [64x4]
            time(us): total=39,934 max=39,934 min=39,934 avg=39,934
/home/lechat/tcc/transpose/src/trans_acc.c
  trans
    6: region entered 1 time
        time(us): total=372,274 init=126,802 region=245,472
                  data=204,136
        w/o init: total=245,472 max=245,472 min=245,472 avg=245,472

With warm-up: 0.24s (max 0.25, min 0.23)

Accelerator Kernel Timing data
/home/lechat/tcc/transpose/src/trans_acc.c
  trans
    9: region entered 2 times
        time(us): total=80,044 init= region=80,043
                  kernels=79,892
        w/o init: total=80,043 max=40,088 min=39,955 avg=40,021
        12: kernel launched 2 times
            grid: [128x2048]  block: [64x4]
            time(us): total=79,892 max=39,953 min=39,939 avg=39,946
/home/lechat/tcc/transpose/src/trans_acc.c
  trans
    6: region entered 2 times
        time(us): total=563,089 init=82,602 region=480,487
                  data=398,955
        w/o init: total=480,487 max=242,329 min=238,158 avg=240,243

My questions are:
Is there something I’m not aware?
Is this a standard/expected behaviour?
Is running a warm-up computation the right way to handle this?

MatColgrove · October 22, 2012, 7:08pm

Hi Lechat,

What you’re seeing is the cost to copy and load the kernel itself over to the GPU. When you call the same kernel repeatedly in succession, it doesn’t need to copy it over again.

Hope this explains what’s going on.

Mat

624801474 · December 11, 2024, 3:21am

the same kernel fun,but diff data?

Topic		Replies	Views
optimize runtime Legacy PGI Compilers	1	1336	March 23, 2018
OpenACC runtime timings Legacy PGI Compilers	1	2008	August 16, 2012
Timing kernel in a loop CUDA Programming and Performance	3	1267	June 20, 2012
Initalizers seems to be slow Legacy PGI Compilers	6	6937	December 17, 2014
Code not accelerated using acc kernels Legacy PGI Compilers	2	3400	January 30, 2017
Strange Performance Issues Strange Performance Issues at the First Kernel Execution CUDA Programming and Performance	1	838	August 8, 2009
Profiling OpenACC Legacy PGI Compilers	7	3777	May 30, 2019
Kernel enqueue overhead Bringing kernel overhead down? CUDA Programming and Performance	9	13736	March 12, 2010
Reasons for long first run of simple kernel OptiX	5	1310	June 14, 2022
($acc parallel loop) VS ( $acc kernels loop ) ? Legacy PGI Compilers	1	2121	January 11, 2013

First run is slower than the following

Related topics