I use pgf90 6.0-2 32-bit target on x86 Linux and I have a program with calls of the type
call foo(event%time)
Here, the subroutine foo is defined in a module, and event is a table of pointers:
type :: event_properties
real :: time
!...
end type event_properties
type (event_properties), allocatable, dimension(:) :: event
I have compiled the code both using pgf90 and ifort, using both optimization and debug flags. The pgf90-generated executable runs about four times slower than the one made by ifort!
Have anyone experienced this, and do you know what I can do to speed up pgf90? I have looked in the manual pages, but did not find any flags that helped.
I’m unaware of any performance problems related to pointer transfer so don’t have any good suggestions for you. Would to be possible to get a copy of this code so we can investigate what’s going wrong? If you could post a link here or send the information to trs@pgroup.com I would appreciate it. Also please post more detail about what optimizations you have tried and what OS your using.
It’s just a 50-line test program. It doesn’t do anything sensible, but it’s stripped down from a “real-life” example. There’s a type declaration with reals, integers, and characters, which are not all used in the test program. I have noticed that the program runs faster if I strip them away, but I need them in the real-life code…
I compiled the code using pgf90 -fast -r8, and ifort -O3 -r8. On my computer, the pgf90-code took 5.7 seconds, while ifort took 1.4 seconds. The CPU-times are similar also without optimization flags.
I’ve passed your code on to our compiler team since it looks like a bug to me. What’s happening is that before the call to “foo” the compiler needs to create a temproray array containing the values of “event%t_last_collision”. This is correct and expected. After the call, we gather up the results and put them back into event. However, since the “time” variable in foo is declared “intent(in)”, we don’t need to do the extra step of saving the results.
Note that you could dramatically speed up your code (with both PGI and Intel) by using a temproary array to store the values of t_last_collision since either compiler must create it’s own temporary array before entry into foo for each iteration of the loop.
Example:
allocate(tmp2(maxpart))
tmp2 = event%t_last_collision
do i=1,10000
! call foo(event%t_last_collision,maxpart,count)
call foo(tmp2,maxpart,count)
end do
Once I know more, I’ll post it here. We appreciate you call this to our attension.
Thank you for your answer (and sorry for my late one). What we found out was similar to your advice, and I ended up using a normal array instead of “event%t_last_collision”, gaining a lot of speed (similar to switching from -g to -fast).
I’d be interested in knowing any further developments, though.