Here’s (a cutdown example illustrating) something I came across today that has me perplexed.
I was checking the timings of two kernels that should have done the same amount of work and thus had very similar timings when I noticed a discrepancy. Further investigation led me to the data copies and the following.
acc_init(acc_device_nvidia);
for (i=0;i<n;i++){
A[i] = rand();
B[i] = rand();
}
t_start = omp_get_wtime();
#pragma acc data copy(B[0:n])
{
}
t_end = omp_get_wtime();
t1 = t_end - t_start;
t_start = omp_get_wtime();
#pragma acc data copy(A[0:n])
{
}
t_end = omp_get_wtime();
t2 = t_end-t_start;
A and B are 1D arrays of size n ints and n is large ( eg 536870912 ). I’m using the OpenMP timer routines as I need to report some timings in the code and I’ve found this method to produce results very similar to that which I get from using the run-time timers (PGI_ACC_TIME’s output).
The card is a Tesla C2050 running on a backend system with a job launcher and in exclusive mode so I’m fairly sure I’m not getting interference from other jobs. I’m using PGI v12.6.
The output I get is:
main
52: region entered 1 time
time(us): total=280,143
data=278,501
/home/gd11/gd11/njohnso1/iftime.c
main
42: region entered 1 time
time(us): total=320,320
data=279,764
acc_init.c
acc_init
38: region entered 1 time
time(us): init=5,136,287
The question is, why do I get an additional ~40ms expended in the loop at 42 cf the loop at 52? I can understand the difference in the data copy time as noise in the system but I’d expect not to find any noise in the region time and for it to be pretty close to the data copy time, give or take a fixed region time overhead. I should note that on occasion I do get the values to fall roughly in line but this is the exception rather than the rule.
Cheers,
-Nick.