Problem running OpenMP shared object via Java front-end

I am working on parallelizing a scientifc simulation package with a rather unusual program setup. This is being run in 32-bit mode on a quad Opteron machine, 32GB RAM, running SuSE EL9 with the PGI v6.1 compilers. A Java-based user-interface allows the user to select various types of simulation parameters, launch a series of calculations, and then view the results in various graphics formats. For these series of calculations, several different shared object libraries are loaded by Java as needed, performing the nitty-gritty simulation number crunching. Each of these shared object libraries are a mix of F77, F90 and C, and with a few C++ wrapper routines for the Java interfacing.
I have been successful in parallelizing these shared object libraries, in a standalone version (without the Java and C++).

The link statement for each of the shared object libraries can be condensed to this :

pgCC -shared -fPIC -L/usr/pgi/linux86/6.1/lib <long_list_of_-mp_-fPIC_compiled_object_files> -mp -lpgftnrtl -lm -lpgc -lgcc -lc -pgf90libs -o libSimDLL.so

Now when I try to run the calculations via the Java front end, with NCPUS=4 and OMP_NUM_THREADS, I get the

“Warning: OMP_NUM_THREADS or NCPUS (4) greater than available cpus (1)”

message. The code continues to execute, but in only single threaded mode. A call within the code to ‘OMP_get_num_procs’ returns ‘1’.
When NCPUS and NUM_THREAD_PROCS=1, the same code runs normally (no warning message), and the call to ‘OMP_get_num_procs’ returns ‘4’.

I have read elsewhere in this forum that I can avoid the warning by dropping the ‘-lpgc’ from the link statement. In doing so, running with OMP_NUM_THREADS and NCPUS=4 results in a crash, with the message

“**ERROR: in a parallel region there is a stack overflow: thread 0, max 8180KB, used 0KB, request 176B”

, displayed a total of three times. Although it appears to have not really overflowed the stack, I changed the stack limit to ‘unlimited’, but this did not prevent the crash. Nor did setting MPSTKZ to 8M.

Is there a recommended ordering of the libraries in the link statement? Could there be something in the Java side that could be affecting this?

My current pseudo-workaround solution is to bootstrap a standalone version to run in parallel mode, but this cuts the communication with the Java user interface.

Try removing “-lm -lpgc -lgcc -lc” from the link line since the pgCC driver already adds these. It also adds “-lpgmp” which needs to come before “-lpgc”.

  • Mat

Thanks for the suggestion Mat.

I managed to reduce the link statement down to this :

pgCC -shared -fPIC -L/usr/pgi/linux86/6.1/lib <long_list_of_-mp_-fPIC_compiled_object_files> -mp -pgf90libs -o libSimDLL.so

This runs fine for NCPUS=1, but produces that same set of (three!) stack overflow messages when NCPUS=4. The call to OMP_get_num_procs returns ‘4’ in both NCPUS cases.

From here, I tried adding ‘-lpgmp’ after the ‘-pgf90libs’. (By the way, is there difference between using ‘-mp’ and ‘-lpgmp’? Does it matter where the ‘-mp’ appears in the list?). Same exact results as my ‘baseline’, for both the NCPUS=1 and 4 cases, and OMP_get_num_procs always returns ‘4’.

Then, I added ‘-lpgc’ after the ‘-lpgmp’. Again, same exact results.

Then, I dropped the ‘-lpgmp’, (leaving only ‘-lpgc’ added to the baseline). Here we get different results: for NCPUS=1, it runs fine, but the call to OMP_get_num_procs returns ‘1’ (I was mistaken in my initial post-sorry!). For NCPUS=4, the message

“Warning: OMP_NUM_THREADS or NCPUS (4) greater than available cpus (1)”

is displayed, and it runs in single-threaded mode. The call to OMP_get_num_procs returns ‘1’.

Just to round out the test suite, I added the ‘-lpgmp’ after ‘-lpgc’ to see if that made any difference. Nope. Same funky results as with just ‘-lpgc’.

Any other ideas to try?

Adding “-mp” to link line simply passes “-lpgmp” to the linker. However, “-lpgmp” must come before “-lpgc” and explains why only one thread is used in those cases where you have “-lpgc” before “-lpgmp” or where you only added “-lpgc”. To see what is being passed to the linker, either do a “dryrun” (See below) or add “-v” for verbose output.

Example of using “-dryrun”:

% pgf77 -V6.2-4 -dryrun -mp x.o
... cut rcfile output ..
/usr/bin/ld /usr/lib64/crt1.o /usr/lib64/crti.o /usr/pgi/linux86-64/6.2-4/lib/trace_init.o /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5//crtbegin.o /usr/pgi/linux86-64/6.2-4/lib/initmp.o /usr/pgi/linux86-64/6.2-4/lib/pgfmain.o -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 /usr/pgi/linux86-64/6.2-4/lib/pgi.ld -lpgc x.o -L/usr/pgi/linux86-64/6.2-4/lib/mp -L/usr/pgi/linux86-64/6.2-4/lib -L/usr/lib64 -L/usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/ -rpath /usr/pgi/linux86-64/6.2-4/lib/mp -rpath /usr/pgi/linux86-64/6.2-4/lib -lpgmp -lpgbind -lnuma -lpgthread -lpgftnrtl -lnspgc -lpgc -lm -lgcc -lc -lgcc /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5//crtend.o /usr/lib64/crtn.o

OpenMP programs do use a lot more stack hence stack overflows are the most common error. Typically setting the stack size to “unlimited” works, however, “unlimited” isn’t really unlimited and the stack can still overflow. In these cases, I manually set the stack size to incrementally larger values until enough stack is available. Try “limit stacksize 128M”, “limit stacksize 512M”, “limit stacksize 1024M”, or even larger values.

-Mat

Thanks for the tip for using the ‘-dryrun’ to see all the gory details. I see that as long as I have the “-mp”, I can drop both “-lpgmp” and “-lpgc” from my link command, as they are automatically put in the correct order in the full link.

In tinkering with the stacksize limits, I have encountered some pretty bizarre behavior:
Starting from the 8192 default stacksize, I kept doubling it until the Java-loaded test routine (with NCPUS=4) stopped crashing with the three(!) stack overflow errors, as shown in my initial post (‘used’ is always 0KB!). At stacksize=524288 (512MB), it doesn’t report a stack overflow, but it appears to be stuck in an infinite loop, as the routine hangs and three of the CPUs report 100% load. My diagnostic prints show this occurs just as it reaches the parallel ‘do’ loop.
When NCPU is changed to 3, and the test case rerun, it reports two(!) stack overflow errors, at the same point in the program. When NCPU is changed to 2, it reports only one stack overflow error. A variation of this test routine adds a “call omp_set_num_threads(1)” prior to the parallel ‘do’ loop. The same behavior is exhibited, except that the diagnostic prints indicate the stack overflow or CPU hang is at this step instead. I would have thought the increased stack size, apparently somewhat suitable for four threads would be also suitable for fewer threads. Only when NCPUS=1 did either version of the test routine complete normally.
At stacksize=1048576 (1024MB), when NCPUS=4 or 3, the test routine hung with only two of the CPUs at 100%. A single stack overflow message is reported for NCPUS=2. NCPUS=1 completes normally.
At stacksize=2097152 (2048MB), when NCPUS=4,3 or 2, the test routine hung with only one CPU at 100%. NCPUS=1 completes normally. For this setting, there are no more stack overflows, but nothing gets done either…

The value of environment variable “MPSTKZ” appears to have no effect at all - during my tests with varying stacksize limits, whether MPSTKZ was undefined, 8M, 2048M or anywhere in between, these results were exactly the same.

Is there anything in an ‘-mp’-compiled shared object versus one compiled ‘normal’ that would be seen differently when it is loaded by Java?
In my efforts to build the standalone parallelized version, I had discovered that I would get random crashes when I mixed ‘normal’ and ‘-mp’ object files. These crashes particularly favored ‘normal’ subroutines (threadsafe! and did NOT contain any OpenMP directives) that were called from within the parallel region. I finally solved that problem by compiling everything with ‘-mp’. I’m beginning to wonder if these stack overflow errors/hung CPUs are symptoms of a similar incompatibility between Java and ‘-mp’ objects, and may not be easily solved…

Any ideas to go on from here?

I’m out of ideas but have forwarded your questions to some our engineers that might.

  • Mat