Hi there,
I noticed this across my benchmarks: the first time a kernel is run, it takes longer, regardless of the usage of acc_init(0).
For each benchmark I have, I run it several times and take the average time of the executions.
I use clock() to measure time, and the main loop is basically:
for (int i=0; i<num_tests; i++) {
time_t begin = clock();
benchmark;
time_t end = clock();
}
I found out that the first iteration would take longer than the subsequent ones.
On the beginning I thought it was because I had forgotten the acc_init() in the beginning of my program, but it actually doesn’t make much of a difference.
I then supposed that the compiler was re-using the data, instead of copying it over again.
I did a small experiment: I created another input/output arrays, with different data, and performed an “empty” benchmark with the fake data before running my real benchmark.
This way, if there was some sort of clever re-using of data, it wouldn’t affect my program: well, it still worked faster.
Then what I did was to change the “warm-up” benchmark to a different thing. I was doing a matrix transpose, but then I wrote a simple array copy with openacc and run that as my warm-up (the benchmark was still the transpose, but the warm-up was the copy).
It didn’t make any difference: regardless of which warm-up, the code ran faster than the warm-up-less code.
Apparently, the first openacc invocation that happens STILL needs to perform some initialization.
Here are some numbers for the matrix transpose 8192x8192:
No warm-up: 0.362 seconds (run 5 times, max: 0.39 min: 0.32)
Accelerator Kernel Timing data
/home/lechat/tcc/transpose/src/trans_acc.c
trans
9: region entered 1 time
time(us): total=40,066 init= region=40,065
kernels=39,934
w/o init: total=40,065 max=40,065 min=40,065 avg=40,065
12: kernel launched 1 times
grid: [128x2048] block: [64x4]
time(us): total=39,934 max=39,934 min=39,934 avg=39,934
/home/lechat/tcc/transpose/src/trans_acc.c
trans
6: region entered 1 time
time(us): total=372,274 init=126,802 region=245,472
data=204,136
w/o init: total=245,472 max=245,472 min=245,472 avg=245,472
With warm-up: 0.24s (max 0.25, min 0.23)
Accelerator Kernel Timing data
/home/lechat/tcc/transpose/src/trans_acc.c
trans
9: region entered 2 times
time(us): total=80,044 init= region=80,043
kernels=79,892
w/o init: total=80,043 max=40,088 min=39,955 avg=40,021
12: kernel launched 2 times
grid: [128x2048] block: [64x4]
time(us): total=79,892 max=39,953 min=39,939 avg=39,946
/home/lechat/tcc/transpose/src/trans_acc.c
trans
6: region entered 2 times
time(us): total=563,089 init=82,602 region=480,487
data=398,955
w/o init: total=480,487 max=242,329 min=238,158 avg=240,243
My questions are:
Is there something I’m not aware?
Is this a standard/expected behaviour?
Is running a warm-up computation the right way to handle this?