Time travel with PGIs system_clock implementation

Hi all,

For a long time there has been a problem with timing our application (the COSMO numerical weather prediction and climate model) using the Portland Group compiler, since negative/strange values were sometimes present in the timing output. Now that I’ve finally found some time to investigate this issue, I’ve stumbled upon two of the sources of the problem. PGI’s system_clock command does not seem to be standard conformant. Two facts violate the standard (see example below):

  1. The clock count is not always monotonically increasing

  2. When a overflow in the counter occurs, the clock count does not restart with 0 but with -count_max

Any suggestions?
Oli

=================================

olifu@rosa1:~[226]> cat test_clock.f90
program test_clock
implicit none
integer :: icount,ir,im,iold
real*8 :: rtime
call system_clock (count=icount,count_rate=ir,count_max=im)
iold=icount
write(*,'(5A15)') 'current','old','rate (1/s)','max','deltaT (s)'
do
  ! get current count
  call system_clock (count=icount)
  ! normal way to compute elapsed time in COSMO
  if (icount >= iold) then
    rtime = dble(icount-iold)/dble(ir)
  else
    rtime = dble(im-(iold-icount))/dble(ir)
  end if
  ! only output when something strange or counter overflow
  if (icount < iold) then
    write(*,'(4I15,F15.7)') icount,iold,ir,im,rtime
  end if
  ! save for next loop iteration
  iold = icount
end do
end program test_clock

olifu@rosa1:~[227]> pgf90 test_clock.f90 -o test_clock

olifu@rosa1:~[228]> ./test_clock
        current            old     rate (1/s)            max     deltaT (s)
        2122468        2122469        1000000     2147483647   2147.4836460
        2990456        2990457        1000000     2147483647   2147.4836460
       16206258       16206259        1000000     2147483647   2147.4836460
       16806248       16806250        1000000     2147483647   2147.4836450
       17538238       17538239        1000000     2147483647   2147.4836460
       17738234       17738235        1000000     2147483647   2147.4836460
...
     2011928352     2011928353        1000000     2147483647   2147.4836460
     2042503892     2042503894        1000000     2147483647   2147.4836450
     2095707096     2095707097        1000000     2147483647   2147.4836460
     2142302397     2142302398        1000000     2147483647   2147.4836460
    -2147483648     2147483647        1000000     2147483647  -2147.4836480
    -2129233250    -2129233249        1000000     2147483647   2147.4836460
    -2128229264    -2128229263        1000000     2147483647   2147.4836460

Hi Oli,

  1. The clock count is not always monotonically increasing

My best guess is that your process is being moved by the OS from one core to another. This would lead to this type of behavior. Try binding to a single core to see if it still occurs.

If it does, please let me know what system you’re using and compiler version. So far I’ve tired 8.0, 9.0, and 10.2 on a variety of systems, but haven’t been able to recreate it.

  1. When a overflow in the counter occurs, the clock count does not restart with 0 but with -count_max

This is a bug. I’ve created a report (TPR#16667) and sent it to our engineers.

Note that in next weeks 10.3 release, we’ll have support for an INTEGER*8 version of SYSTEM_CLOCK (when compiled with -i8). This will increase the MAXCOUNT to 9223372036854775807.

Thanks,
Mat

Hi Matt,

Thanks for the quick reply.

  1. The clock count is not always monotonically increasing
    My best guess is that your process is being moved by the OS from one core to another. This would lead to this type of behavior. Try binding to a single core to see if it still occurs.

Your absolutely correct. Setting processor affinity (see below) removed the jitter for problem 1.

olifu@rosa3:~[159]> taskset -p 0x00000001 17913
pid 17913's current affinity mask: 3
pid 17913's new affinity mask: 1

I’ve been running this on a Cray XT5. You’re explanation also explains why part of the problem only occurred when running on the service nodes. The compute nodes have core affinity for each process setup by default.

  1. When a overflow in the counter occurs, the clock count does not restart with 0 but with -count_max
    This is a bug. I’ve created a report (TPR#16667) and sent it to our engineers.

Perfect. For the moment I will have to use a workaround. As we are not admins on these machines I’ll forward the information to the people in charge of the PGI compiler installation.

Thanks,
Oli

Your problem with SYSTEM_CLOCK not resetting to zero after MAXCOUNT
reached has been corrected in the current 10.4 release.

Thanks again for your submission.

regards,
dave