-Mf90 switch and compiling an SMP program with pghpf

I have a finite difference solver written in Fortran 90 with HPF directives inserted. I’m am running a dual proccesor xeon with 2Gb of mem.

If I use the intel compiler v8 ifort with vectorisation turned on my code executes 10 timesteps in 30s, running in serial, compiling the same code with the pghpf compiler results in an execution time of 2m30s.
If I add the -Mf90 switch this comes down to the intel time of 30s

Since I have a dual proccesor machine I would like to run a smp setup.

Compiling using -Msmp with the pghpf compiler reduces the run time to 1m18s,running at around 90% on both cpus,however if I add in the -Mf90 switch, the two processors both run the same code with no communication.

The intel compiler refuses to recognise parallel sections of code(complains about not being able to read the trip count) and therefore I cannot use this to compare times for an smp setup.

My question is why does the -Mf90 switch seem to disable or overide the -Msmp switch,I am still getting messages about loops being parallelised,its just when I execute the code it simply runs two separate processes.

The compilations switches and commands I have used are as follows
(2m30s serial,1m30s smp)
Pghpf -Mautopar -fastsse -O2 -tp p7 -Mcache_align -Mvect=sse -Mconcur=assoc -Minline=levels:10 -Minfo=all -Minform=inform -Mnofree -Msmp fd3d_map.f90 fd3drout1.f90 diffrout1.f90 -o pghpf_fd3d.x
adding -Mf90 reduces serial to 30s,but does not work for smp.

command serial
./pghpf_fd3d.x -pghpf -stat alls

command smp
./pghpf_fd3d.x -pghpf -np 2 -heapz 150m -stat alls

I’m not sure I can answer this completely without seeing your code, but let me try:

If you compile with pghpf and -Mf90, you are basically disregarding the HPF directives and in effect using our pgf90 compiler. So, this is the serial case you are mentioning that runs in 30 seconds. Since the code compiled with HPF runs in 5x that amount, it seems there is just alot of overhead for the parallelization that is being performed. But, you are using the default communication library here.

When you compile with pghpf and -Msmp, you do get improved performance over the default communication library, but it is still 2.5x slower than the serial code. It could be that the decomposition is too fined grained for you to get better performance than the serial case. I can’t answer that without seeing the code.

It maybe that some of the options you are using are clobbering each other. I see pghpf with -Mautopar and -Mconcur and -Msmp. If all of the loops you are hoping to parallelize are in foralls, you probably don’t want autopar, and I don’t think in any case you want -Mconcur, because that might be conflicting for the total number of threads at runtime. Again, it is hard to know. Have you profiled the code?

In general, for dual processor SMP machines, you would be better suited running an OpenMP version of the code. Or, trying -Mconcur without hpf.

  • Brent

Hi Brent,

Thanks for the reply, I have tried compiling using pghpf and just the -Mautopar switch, (there are a few independent do loops but no foralls in the code), the code runs but the performance is poor and gets worse as more processors are added, so its seems the overheads or excessive communication are slowing things down,one other possibility I am using ssh as a remote shell,since our system is setup to dissallow rsh,would this add an additional overhead?.

With this in mind I concentrated on the smp setup I compiled using pgf95 and the following optimisation switches.

-fastsse -Mconcur -Mipa=fast,inline

Running in serial now takes 43s for 50 iterations, compiling with ifort results in a run time of 1m24s, which is significant speedup.

If I set NCPUS=2 and run again, I can see both cpus running at 90-100% however they both use the same amount of memory (the same as for the serial case) and the time taken is exactly the same. This is despite the fact that -Minfo reports many parallel loops.

I have profiled the code and 90% of the time is taken up in 3 or 4 subroutines which do the differencing and integration,the outermost loops in these routines have been reported as parallelised,so I’m confused at why I am seeing no speed up with 2 cpus. It would appear both cpus are doing the same work rather than sharing it.

If it helps I could send you the code.


For you smp code, if you are compiling with -Mconcur, and we are parallelizing the outer loops, but you aren’t seeing speedup, there may be a few causes.

I see you are running on xeon. You should probably make some sort of determination as to whether your code is already memory bound. This can be hard. Is there data reuse? It may help to get an idea of bytes of data needed per unit of computation. One other question is whether the loops contain “enough” computation to outweigh the cost of going parallel. This can also be hard to answer. If you change the loop bounds drastically up or down (as a meaningless exercise) does the parallel speedup change dramatically as well…

If you want, send your code to trs@pgroup.com, with instructions on how to build and run. Send attention: Brent, and I will look at it as soon as I get a chance.

I have forwarded the code on.