Low performance no similar machine

Hi,
we are using MPICH and pgi for our model. Version compiled on Intel dual quad-core utilizes 100% cpu on the machine where it was copiled (using 8 processes) , but when i move it to new machine, (quad-core, only newer serie) it only uses 50% of cpu,(again 8 processes).
Both machines have Fedora 8,
slow cpu’s : Intel® Xeon® CPU E5440 @ 2.83GHz
fast cpu’s : Intel® Xeon® CPU X5450 @ 3.00GHz
kernel : Linux version 2.6.23.1-42.fc8
MPICH was compiled as advised on PGI website.

Any suggestion is welcomed, as i have no idea where to look for a solution

Vladimir

Hi Valdimir,

I’m a little unclear as to the problem. Am I correct that you are running an MPICH application on one machine using “mpirun -np 8” and you see 100% utilization on each of the 8 cores. On the second system when you run the exact same application using “mpirun -np 8”, do you see 100% utilization on 4 cores or 50% utilization on 8 cores?

If you are seeing only 4 cores utilized, I would look at you machines file and see if it is configured correctly.

  • Mat

Hi,
we have two servers, both containing two quad core CPUs each. We run application with mpirun -np 8 on both servers. On one of them application uses all 8 CPUs with 100% load, on other utilization on all 8 CPUs varies arround 50%, almost never going over 60%, except in first few seconds, just after start.
On both servers machine file contains only name of that server without number of CPUs. The pgi version on one is 7.1-2(the fast one), and on other we tried copying binary from the “fast” one, and today we used trial license(7.2) to compile the model, both having the same result ( 8 x 50%).
Fast server
part of /proc/cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           X5450  @ 3.00GHz
stepping        : 6
cpu MHz         : 3000.106
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips        : 6003.43
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:



uname -a
Linux earth 2.6.23.1-42.fc8 #1 SMP Tue Oct 30 13:18:33 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

Slow server

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz
stepping        : 6
cpu MHz         : 2833.442
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips        : 5670.05
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:



uname -a
Linux mercury 2.6.25.4-10.fc8 #1 SMP Thu May 22 22:58:37 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

The slow one has 8GB, and fast one 4GB of RAM. We are running meteorological model. If you require more info, just ask.
Vladimir

Hi Valdimir,

If your total utilization was 50%, then I would say that the 8 processes were being bound to only 4 cores and you should update your machine file to use “hostname:8”. However, it sounds like you’ve used top (pressing ‘1’ to view all the cores) and verified that all 8 cores are active but each are only using 50% utilization. Honestly, this doesn’t make a lot of sense and I’m not sure why this would happen especially since you’re using the exact same binary and have more memory on the ‘slow’ system.

What happens if you reduce the number of threads (-np 1, -np 2, -np 4)? Is your application I/O intensive and your data on a NFS?

  • Mat

This is the output of top with 1 pressed. The application questioned is ETA meteorological model. As far as i can tell, the only difference betwen the servers is in kernel revison, and that is only minor one. So i am hoping for some kind of inconsistency between kernel, MPICH and PGI . The fact that it is an identical binary, leads me towards this conclusion, although it’s rather weak explanation.
The data is completely on local hard drive, no NFS. As for the reduction of number of threads, i get an error if i do it caused by scheduling algorithm, so i can only run it if i use -np 8. Not a lot of options here :).

top - 21:09:27 up 9 days,  6:28,  2 users,  load average: 1.73, 0.44, 0.14
Tasks: 198 total,   7 running, 191 sleeping,   0 stopped,   0 zombie
Cpu0  : 37.5%us,  5.1%sy,  0.0%ni, 56.8%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
Cpu1  : 38.4%us,  9.3%sy,  0.0%ni, 51.5%id,  0.0%wa,  0.0%hi,  0.8%si,  0.0%st
Cpu2  : 45.1%us,  4.4%sy,  0.0%ni, 50.0%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
Cpu3  : 35.1%us,  2.6%sy,  0.0%ni, 62.1%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu4  : 32.3%us,  6.0%sy,  0.0%ni, 61.5%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu5  : 35.1%us,  6.2%sy,  0.0%ni, 58.2%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
Cpu6  : 30.1%us,  1.0%sy,  0.0%ni, 68.6%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu7  : 53.7%us,  8.4%sy,  0.0%ni, 37.1%id,  0.0%wa,  0.0%hi,  0.8%si,  0.0%st
Mem:   8197708k total,  6616944k used,  1580764k free,   169544k buffers
Swap:  2031608k total,       48k used,  2031560k free,  5021668k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9536 mitopr    20   0  266m 141m 3448 S   52  1.8   0:11.87 etafcst_all.x
 9548 mitopr    20   0  266m 140m 3416 S   50  1.8   0:11.01 etafcst_all.x
 9524 mitopr    20   0  266m 141m 3448 R   47  1.8   0:10.58 etafcst_all.x
 9488 mitopr    20   0  266m 141m 3424 R   46  1.8   0:11.25 etafcst_all.x
 9476 mitopr    20   0  266m 141m 3452 R   45  1.8   0:10.23 etafcst_all.x
 9500 mitopr    20   0  266m 140m 3392 S   43  1.8   0:09.81 etafcst_all.x
 9512 mitopr    20   0  266m 140m 3408 R   36  1.8   0:08.62 etafcst_all.x
 9464 mitopr    20   0  266m 141m 3472 R   35  1.8   0:08.44 etafcst_all.x
 1565 root      39  19     0    0    0 S    0  0.0  32:19.40 kipmi0

Vladimir

Hi Valdimir,

Given that it’s the same binary, I doubt it’s anything to do with the compiled code. Rather I’d focus on the MPICH and systems configuration. Which MPICH are you using? Try using the exact same install (if possible) and configuration to see if the problem is there.

If you still can’t figure it out, you may need to profile the processes to determine why they are blocked. Which PGI product do you have? The PGI CDK product includes MPI profiling which may be helpful here.

  • Mat

Hi,
it seems it is a kernel issue. I changed to same kernel i have on fast machine and it is now working at fill speed.

9640 mitopr    20   0  266m 141m 3792 R   97  1.8   0:31.37 etafcst_all.x
 9628 mitopr    20   0  266m 141m 3792 R   96  1.8   0:30.96 etafcst_all.x
 9592 mitopr    20   0  266m 141m 3672 R   96  1.8   0:31.15 etafcst_all.x
 9580 mitopr    20   0  266m 141m 3708 R   95  1.8   0:30.61 etafcst_all.x
 9604 mitopr    20   0  266m 140m 4008 R   92  1.8   0:29.05 etafcst_all.x
 9652 mitopr    20   0  266m 140m 4112 R   91  1.8   0:29.25 etafcst_all.x
 9616 mitopr    20   0  266m 140m 3756 R   90  1.8   0:28.54 etafcst_all.x
 9568 mitopr    20   0  266m 141m 3724 R   88  1.8   0:28.14 etafcst_all.x

But i have no idea what caused the difference since it was only a minor kernel revision 2.6.23.1-42.fc8 vs 2.6.25.4-10.fc8 . It is beyond my knowledge, so i’ll leave it the way it is
Thank you for your help

Vladimir