WRFV3.2 multi target parallel make

I am trying to run the wrfv3.2 model in dm mode using qlogic infiniband interconnects and a series of 1U x86-64bit servers.
I am using the latest pgi 10.3 compilers. Most servers are amd based architecture (k8-64e) and I have a few intel based servers I am trying to add into the cluster - the newest one being a dual socket 5500 series nehalem based server.

I am compiling on the new nehalem and am using flags of “-tp k8-64e,nehalem-64” for a parallel make. The model runs fine for a while but usually dies at some point into the run with a glibc -> memory corruption error. The full error message is listed below. Ive tried compiling with “-tp x64” option as well -> same thing happens. The FCOPTIM flags are:
-fastsse -Mvect=noaltcode -Msmartalloc -Mprefetch=distance:8 -Mfprelaxed

Anyone have any ideas on what may be happening here?

Thanks,
Aaron

wrf.exe:18718 terminated with signal 11 at PC=16293a8 SP=7ffff92aebd0. Backtrace:
*** glibc detected *** ./wrf.exe: malloc(): memory corruption: 0x000000001ed590f0 ***

Hi Aaron,

While I’m not positive that this is the problem, the most common cause of a WRF seg fault is due to the stack size being too small. Try setting your stack size to a large value or unlimited in your shell’s configuration file (.bashrc, or .cshrc).

Hope this helps,
Mat

I had already tried that one, I have it set in my environment and script that I am running to unlimit the stack size. Another tidbit: all the model executable tasks do not necessarily terminate on every machine, and ill often have to manually kill a few processes - even though the model has effectively stopped executing. This error is different than just a normal “seg fault”

Hi Aaron,

I’m not sure then. The next thing I’d do would would be to try a smaller workload and less MPI processes. I’d even go down to a single node running only a few processes. If it works, then your most likely hitting some resource limit (even with ‘unlimited’ set, the stack size is has system dependent hard limit) or there is problem with a particular node. If you’re using the hybrid MPI/OpenMP version of WRF, you can also try increasing your OMP_STACKSIZE.

My next step would then be to run the application in a debugger. The PGI debugger, PGDBG, would work but you need a CDK license for cluster debugging. A Workstation license still allows you to debug MPI, but only on a single node.

  • Mat