OpenACC runtime timings

Here’s (a cutdown example illustrating) something I came across today that has me perplexed.

I was checking the timings of two kernels that should have done the same amount of work and thus had very similar timings when I noticed a discrepancy. Further investigation led me to the data copies and the following.

  acc_init(acc_device_nvidia);

  for (i=0;i<n;i++){
    A[i] = rand();
    B[i] = rand();
  }

  t_start = omp_get_wtime();
#pragma acc data copy(B[0:n])
  {
  }
  t_end = omp_get_wtime();


  t1 = t_end - t_start;


  t_start = omp_get_wtime();
#pragma acc data copy(A[0:n])
  {
  }
  t_end = omp_get_wtime();


  t2 = t_end-t_start;

A and B are 1D arrays of size n ints and n is large ( eg 536870912 ). I’m using the OpenMP timer routines as I need to report some timings in the code and I’ve found this method to produce results very similar to that which I get from using the run-time timers (PGI_ACC_TIME’s output).

The card is a Tesla C2050 running on a backend system with a job launcher and in exclusive mode so I’m fairly sure I’m not getting interference from other jobs. I’m using PGI v12.6.

The output I get is:

  main
    52: region entered 1 time
        time(us): total=280,143
                  data=278,501
/home/gd11/gd11/njohnso1/iftime.c
  main
    42: region entered 1 time
        time(us): total=320,320
                  data=279,764
acc_init.c
  acc_init
    38: region entered 1 time
        time(us): init=5,136,287

The question is, why do I get an additional ~40ms expended in the loop at 42 cf the loop at 52? I can understand the difference in the data copy time as noise in the system but I’d expect not to find any noise in the region time and for it to be pretty close to the data copy time, give or take a fixed region time overhead. I should note that on occasion I do get the values to fall roughly in line but this is the exception rather than the rule.

Cheers,
-Nick.

Hi Nick,

Total time is measured from the host side while data transfer time is measured from the device. Hence, the extra time is occurring on the host. Though, I can only speculate the exact reason for the increased time. Some possible reasons:

Device to host transfers require use of pined memory and thus a host to host copy needs to be performed (virtual memory to pinned physical memory). The OS may need to reap this memory before getting the second array.

It could be a caching issue where B is cached while A is coming from main memory.

The OS could be interrupting the code while the copy is occurring and not getting back to code for a bit longer.

One thing to try is rearranging the code a bit:

  acc_init(acc_device_nvidia);

  for (i=0;i<n;i++){
    B[i] = rand();
  }

  t_start = omp_get_wtime();
#pragma acc data copy(B[0:n])
  {
  }
  t_end = omp_get_wtime();


  t1 = t_end - t_start;


  for (i=0;i<n;i++){
    A[i] = rand();
  }

  t_start = omp_get_wtime();
#pragma acc data copy(A[0:n])
  {
  }
  t_end = omp_get_wtime();


  t2 = t_end-t_start;

If it is a memory caching issue, I would expect this code to equal out the CPU time.

  • Mat