my parallel code broke

Hi All,

I have a set of modeling codes that use common subroutines. There is one code that does a “forward” model (give it a vector of parameters and various model files are computed), and several other codes that implement various optimizing algorithms (genetic algorithm, MCMC sampling, etc.). Many of the optimizing codes have loops that have been successfully parallelized.

The guts of the code involves routines that integrate Newton’s equations of motion for up to 10 bodies. When developing these routines, it was convenient to put the positions of all bodies at all time steps into a large 2D array. Likewise for the velocities of each body. These arrays had dimensions of (N,30), where N was typically 500,000.

Recently to improve the speed of the code, I replaced each 2D array with 30 1D arrays (each body has an x, y, and z coordinate and velocity). This substitution was successful. The forward model ran about 10% faster, and produced exactly the same results.

Here is the problem: The optimizing codes also seem to work when run on a single core. However, they don’t work properly when run in parallel, in spite of the fact that the optimizing codes themselves are unchanged. Each parallel loop has a call to a “monster” routine, and the monster routine calls the various subroutines to compute a model given the input vector of parameters. This monster routine was not altered.

Some possible clues:

  1. The optimizing codes work when run on a single core (they give the same results compared to the code version with the 2D arrays).

  2. The optimizing codes work when compiled with the -mp flag, and when using a single thread only.

  3. If I use 2 threads, then the optimizing codes don’t produce the same end result. In my test code, 40 random input parameter set vectors are generated, and likelihoods are computed for them. The code writes various results to various files, and both the “before” and “after” codes report exactly the same 40 random input sets. However, the “after” code produces incorrect likelihoods for half of the sets.

  4. If I use more and more threads, the “after” code still reports the proper 40 random input sets, but the fraction of correct likelihood values goes down.

  5. As noted above, the proper input parameter sets seem to be correctly passed on to the various subroutines that need them.

  6. The “before” codes did run properly in parallel mode, where the end result did not depend on the number of threads used, and where the output files were identical, once they were sorted by index number.

I would appreciate any hints, suggestions, etc. Would the PGI debugger be useful here? If so, what tasks should I invoke, what buttons should I push, etc.?



I forgot to mention that OpenMP is used to make the codes work in parallel.

Hi All,

Some progress to report: I used the -Mchkstk flag to check for stack overflows, and the code reports:

Error: in routine monster there is a
stack overflow: thread 1, max 195313KB, used 0KB, request 656B

I have set the OMP_STACKSIZE environment variable to be about as large as it can get:

setenv OMP_STACKSIZE 300000000

create error: 11
Error: pthread_create: Cannot allocate memory

setenv OMP_STACKSIZE 200000000

Error: in routine monster there is a
stack overflow: thread 1, max 195313KB, used 0KB, request 656B

By using several write statements, I have gone a few subroutines deep to see how far things can go before that stack overflow error pops up.

What is the meaning of that error message? The computer has plenty of RAM:

cat /proc/meminfo 
MemTotal:       264540716 kB
MemFree:        116064480 kB
MemAvailable:   262310908 kB

There is a maximum stack size of 195313KB, which seems like it is a lot less than what the system actually has.

What is strange (to me at least) is that my “before” code also has a stack overflow, but for some reason that does not matter. It consistently produces the correct output (relative to the single-core version) independent of the number of threads. When I use 40 threads, “top” shows about 25% of the memory is used, which is less than 100%.

I assume the stack size limit is important, otherwise there would not be a compiler flag to see if that limit is exceeded. So my question is why does this limit seem to matter for the most recent version of my code?