I have been working with the PGI 6.0 compilers for about a month now whilst porting some in house simulation codes. We are finding that the performance scaling with PGI OpenMP running on Dual and Quad opteron systems is pretty poor.
We have three simulation codes which were all developed on large SMP boxes (mostly SGI) where they all scale pretty much linearly up to between 8 and 32 processors. However, when we build and run on the opterons we are only getting a speedup of 1.2-1.3 on dual processor systems and 2.1-2.3 on quad boxes.
Am I missing some optimisation options here ? There may be an issue with memory locality which could be either a kernel problem or internal to the complier. It appears that the codes are bottlenecking on memory transfers.
I’m tempted to think that this is a compiler/openmp implementation problem since we are finding pretty much the same thing with f77,f90, C and C++ but code like HPL Linpack scales pretty well when using MPI to handle the thread creation.
Any one had similar issues with their own code ?