First run is slower than the following

Hi there,

I noticed this across my benchmarks: the first time a kernel is run, it takes longer, regardless of the usage of acc_init(0).

For each benchmark I have, I run it several times and take the average time of the executions.
I use clock() to measure time, and the main loop is basically:

for (int i=0; i<num_tests; i++) {
  time_t begin = clock();
  benchmark;
  time_t end = clock();
}

I found out that the first iteration would take longer than the subsequent ones.
On the beginning I thought it was because I had forgotten the acc_init() in the beginning of my program, but it actually doesn’t make much of a difference.

I then supposed that the compiler was re-using the data, instead of copying it over again.
I did a small experiment: I created another input/output arrays, with different data, and performed an “empty” benchmark with the fake data before running my real benchmark.
This way, if there was some sort of clever re-using of data, it wouldn’t affect my program: well, it still worked faster.

Then what I did was to change the “warm-up” benchmark to a different thing. I was doing a matrix transpose, but then I wrote a simple array copy with openacc and run that as my warm-up (the benchmark was still the transpose, but the warm-up was the copy).
It didn’t make any difference: regardless of which warm-up, the code ran faster than the warm-up-less code.

Apparently, the first openacc invocation that happens STILL needs to perform some initialization.

Here are some numbers for the matrix transpose 8192x8192:
No warm-up: 0.362 seconds (run 5 times, max: 0.39 min: 0.32)

Accelerator Kernel Timing data
/home/lechat/tcc/transpose/src/trans_acc.c
  trans
    9: region entered 1 time
        time(us): total=40,066 init= region=40,065
                  kernels=39,934
        w/o init: total=40,065 max=40,065 min=40,065 avg=40,065
        12: kernel launched 1 times
            grid: [128x2048]  block: [64x4]
            time(us): total=39,934 max=39,934 min=39,934 avg=39,934
/home/lechat/tcc/transpose/src/trans_acc.c
  trans
    6: region entered 1 time
        time(us): total=372,274 init=126,802 region=245,472
                  data=204,136
        w/o init: total=245,472 max=245,472 min=245,472 avg=245,472

With warm-up: 0.24s (max 0.25, min 0.23)

Accelerator Kernel Timing data
/home/lechat/tcc/transpose/src/trans_acc.c
  trans
    9: region entered 2 times
        time(us): total=80,044 init= region=80,043
                  kernels=79,892
        w/o init: total=80,043 max=40,088 min=39,955 avg=40,021
        12: kernel launched 2 times
            grid: [128x2048]  block: [64x4]
            time(us): total=79,892 max=39,953 min=39,939 avg=39,946
/home/lechat/tcc/transpose/src/trans_acc.c
  trans
    6: region entered 2 times
        time(us): total=563,089 init=82,602 region=480,487
                  data=398,955
        w/o init: total=480,487 max=242,329 min=238,158 avg=240,243

My questions are:
Is there something I’m not aware?
Is this a standard/expected behaviour?
Is running a warm-up computation the right way to handle this?

Hi Lechat,

What you’re seeing is the cost to copy and load the kernel itself over to the GPU. When you call the same kernel repeatedly in succession, it doesn’t need to copy it over again.

Hope this explains what’s going on.

  • Mat