I have a set of modeling codes that use common subroutines. There is one code that does a “forward” model (give it a vector of parameters and various model files are computed), and several other codes that implement various optimizing algorithms (genetic algorithm, MCMC sampling, etc.). Many of the optimizing codes have loops that have been successfully parallelized.
The guts of the code involves routines that integrate Newton’s equations of motion for up to 10 bodies. When developing these routines, it was convenient to put the positions of all bodies at all time steps into a large 2D array. Likewise for the velocities of each body. These arrays had dimensions of (N,30), where N was typically 500,000.
Recently to improve the speed of the code, I replaced each 2D array with 30 1D arrays (each body has an x, y, and z coordinate and velocity). This substitution was successful. The forward model ran about 10% faster, and produced exactly the same results.
Here is the problem: The optimizing codes also seem to work when run on a single core. However, they don’t work properly when run in parallel, in spite of the fact that the optimizing codes themselves are unchanged. Each parallel loop has a call to a “monster” routine, and the monster routine calls the various subroutines to compute a model given the input vector of parameters. This monster routine was not altered.
Some possible clues:
The optimizing codes work when run on a single core (they give the same results compared to the code version with the 2D arrays).
The optimizing codes work when compiled with the -mp flag, and when using a single thread only.
If I use 2 threads, then the optimizing codes don’t produce the same end result. In my test code, 40 random input parameter set vectors are generated, and likelihoods are computed for them. The code writes various results to various files, and both the “before” and “after” codes report exactly the same 40 random input sets. However, the “after” code produces incorrect likelihoods for half of the sets.
If I use more and more threads, the “after” code still reports the proper 40 random input sets, but the fraction of correct likelihood values goes down.
As noted above, the proper input parameter sets seem to be correctly passed on to the various subroutines that need them.
The “before” codes did run properly in parallel mode, where the end result did not depend on the number of threads used, and where the output files were identical, once they were sorted by index number.
I would appreciate any hints, suggestions, etc. Would the PGI debugger be useful here? If so, what tasks should I invoke, what buttons should I push, etc.?