WRFV3.2 multi target parallel make

apsims · April 15, 2010, 7:44pm

I am trying to run the wrfv3.2 model in dm mode using qlogic infiniband interconnects and a series of 1U x86-64bit servers.
I am using the latest pgi 10.3 compilers. Most servers are amd based architecture (k8-64e) and I have a few intel based servers I am trying to add into the cluster - the newest one being a dual socket 5500 series nehalem based server.

I am compiling on the new nehalem and am using flags of “-tp k8-64e,nehalem-64” for a parallel make. The model runs fine for a while but usually dies at some point into the run with a glibc → memory corruption error. The full error message is listed below. Ive tried compiling with “-tp x64” option as well → same thing happens. The FCOPTIM flags are:
-fastsse -Mvect=noaltcode -Msmartalloc -Mprefetch=distance:8 -Mfprelaxed

Anyone have any ideas on what may be happening here?

Thanks,
Aaron

wrf.exe:18718 terminated with signal 11 at PC=16293a8 SP=7ffff92aebd0. Backtrace:
*** glibc detected *** ./wrf.exe: malloc(): memory corruption: 0x000000001ed590f0 ***

MatColgrove · April 15, 2010, 8:05pm

Hi Aaron,

While I’m not positive that this is the problem, the most common cause of a WRF seg fault is due to the stack size being too small. Try setting your stack size to a large value or unlimited in your shell’s configuration file (.bashrc, or .cshrc).

Hope this helps,
Mat

apsims · April 16, 2010, 12:41am

I had already tried that one, I have it set in my environment and script that I am running to unlimit the stack size. Another tidbit: all the model executable tasks do not necessarily terminate on every machine, and ill often have to manually kill a few processes - even though the model has effectively stopped executing. This error is different than just a normal “seg fault”

MatColgrove · April 16, 2010, 6:33pm

Hi Aaron,

I’m not sure then. The next thing I’d do would would be to try a smaller workload and less MPI processes. I’d even go down to a single node running only a few processes. If it works, then your most likely hitting some resource limit (even with ‘unlimited’ set, the stack size is has system dependent hard limit) or there is problem with a particular node. If you’re using the hybrid MPI/OpenMP version of WRF, you can also try increasing your OMP_STACKSIZE.

My next step would then be to run the application in a debugger. The PGI debugger, PGDBG, would work but you need a CDK license for cluster debugging. A Workstation license still allows you to debug MPI, but only on a single node.

Mat

Topic		Replies	Views
Segmentation fault error when running wrf.exe (WRFv3.8.1) Legacy PGI Compilers	3	2326	May 29, 2020
WRF compiler optimisation Legacy PGI Compilers	9	34543	December 16, 2004
pgf90 segfault Legacy PGI Compilers	6	7974	November 15, 2012
compiling problem with WRFV3 Legacy PGI Compilers	6	15607	April 8, 2013
Opterons, MM5 and pgi 6.2 Legacy PGI Compilers	1	3101	June 22, 2007
WRF 2.2 PGF 6.2.5 Linux 86_64 -fastsse Legacy PGI Compilers	3	6146	September 19, 2007
WRF 2.2 and pgi 6.2 Legacy PGI Compilers	2	4258	May 8, 2007
Error when compile WRF 4.2.2 with nv 21.3 nvc, nvc++ and nvfortran	3	947	June 7, 2021
Problem in compiling WRFV2.1.1 Legacy PGI Compilers	4	9636	January 10, 2006
Compiling WRF3.1 on Rocks 5.1 cluster Legacy PGI Compilers	8	23062	December 17, 2009

WRFV3.2 multi target parallel make

Related topics