Memory Access Error Using OpenMP with OMP_NUM_THREAD=2

Steven_Speer · September 21, 2004, 6:24pm

Hello,

I have compiled a legacy CFD code using the pgf77/pgcc compilers.

The code has conditional compiler directives that compile either using “Parallel Do” OpenMP (Newly added) or the legacy SGI parallel “C$DO ACROSS” (Previously coded and validated).

The code runs fine on Unix shared memory computers.
However, I get a memory access error when I run the PGI version on my dual Xeon processor Intellistation computer running Windows 2000 Pro with OMP_NUM_THREADS=2 and NCPUS=2. Here are the two error, one from each thread:

The instruction at “0x00587906” referenced memory at"0xfaf9e60c". The memory could not be “written”.

The instruction at “0x00587d88” referenced memory at"0xde93a0e4". The memory could not be “written”.

The funny thing is that when I run the same version of code with OMP_NUM_THREADS=1 and NCPUS=1 the code runs fine. I don’t know if its a problem with my code or with the OpenMP compiler?

I have tried compiling using the older SGI “C$DO ACROSS” syntax that was already coded and which has been well tested on Various computers over time as a check to ensure that maybe I didn’t introduce something with the New OpenMP commands. Both versions crash due to a memory access error when I specify running using 2 processors, but both run fine if I just indicate to use 1 processor. Why does this happen?

Care has been taken to insure that necessary Loop variables and other variables within the parallelized loops have been specified as “PRIVATE”

Do I need to use a special compiler flag for the Xeon processor? Is there some compiler flag I need for OpenMP codes to prevent memory access errors? It there a Windows OpenMP debugger, it seems like the PGI debugger is only for LINUX, or am I mistaken?

Thanks for any solutions that the forum might be able to offer.

-Steven Speer

MatColgrove · September 21, 2004, 10:39pm

This is most likely a stack overflow error. On UNIX I’d have you increase your stack size using the ‘unlimit’ command, but on Windows you’ll need to pass the stack size to the linker. On the link line add “-Wl,–stack=” where is the size of the stack. Try setting the size to a large value like 32000000 or higher.

Do I need to use a special compiler flag for the Xeon processor? Is there some compiler flag I need for OpenMP codes to prevent memory access errors? It there a Windows OpenMP debugger, it seems like the PGI debugger is only for LINUX, or am I mistaken?

The architecture flag for a Xeon is “-tp p7”, but this is the default if you compiling on a Xeon so it’s not necessary to add.

There is not a method for the compiler to prevent all memory access errors, however, the flags “-Mbounds”, “-Mchkfpstk”, “-Mchkptr”, and “-Mchkstk” do help in detecting some types. They are useful for debugging but I wouldn’t use them for production code since you’ll have a serve performance penalty by using them.

Currently, pgdbg is only available for Linux.

Let me know how it goes. Just in case you didn’t see, Hongyon was able to find a much better work around for your 'HOSTNM" problem.

Mat

Steven_Speer · September 22, 2004, 8:08pm

Mat,

I tried what you suggested concerning passing the stacksize to the linker, but that still didn’t solve the problem. I started out at 32meg which didn’t work and kept trying each time doubling the stacksize 64, 128, 256, 512,… until I got to the point where there wasn’t enough memory to store the necessary arrays and the dynamics memory allocation error catching routines started to complain. This happened after I got above 500 megs.

By the way my computer has 2GBytes of RAM. The test case that I have been trying is a typical medium sized case with 3.3 million grid points and requires 1.14 GByte of RAM when running on a single processor. This is a medium test case, it is not uncommon for me to run cases that are nearly double the size. The single processor version doesn’t have a probelm with the stacksize.

To reiterate, this test case ran just fine using the same version of the PGI compile code if I had OMP_NUM_THREADS=1, but crashed with a memory access error if OMP_NUM_THREADS=2.

In order to determine if this memory problem is model size dependent I ran the same PGI compiled code using a smaller 300,000 point test case. This smaller test case runs fine using either OMP_NUM_THREAD=1 or 2. However the performance of the code when OMP_NUM_THREAD=2 is terrible. The dual_processor mode took five times the amount of time to run than the same code ran using just 1 processor!! I was expecting that that the dual processor version would at least run 1.5-1.75 times faster than the single processor version being generous with the amount of additional overhead associated with splitting the loops up. Here is a break down of the time (using the same compiled version of code):

OMP_NUM_THREAD and NCPUS = 2:

TOTAL WALLCLOCK TIME = 0.12 hours ( 416.2 s)
COMPUTATION TIME = 0.11 hours ( 411.8 s; 98.95% )
TOTAL I/O TIME = 0.00 hours ( 2.9 s; 0.69% )
INITIALIZATION TIME = 0.00 hours ( 1.5 s; 0.36% )

OMP_NUM_THREAD and NCPUS =1:

TOTAL WALLCLOCK TIME = 0.02 hours ( 79.5 s)
COMPUTATION TIME = 0.02 hours ( 75.8 s; 95.40% )
TOTAL I/O TIME = 0.00 hours ( 3.4 s; 4.32% )
INITIALIZATION TIME = 0.00 hours ( 0.2 s; 0.28% )

Observing the Windows Task Manager cpu usage monitor during repeated test runs, I noted: 1) that the dual-processor run never fully utilized the two processors – only using an approximately constant 4-10% of the processors, while 2) the single processor run used typically 50-60% with one processor doing all the work. This would account for the time discrepancy, but doesn’t resolve the issue why the two processors are not being better utilized.

Can anyone offer advice on what I maybe overlooking that would account for such poor performance? My trial period with the compiler will end soon.

Thanks for all the help. Can anyone recommend a software profiler for Windows?

-Steven Speer

MatColgrove · September 23, 2004, 4:21pm

I’m assuming the 1.14 GB used by the program is the amount used for the data, so the total amount of memory used by the program could be higher. The extra overhead (stack and system structures) that accompany the second thread, plus the amount of memory used by the OS and other processes, is probably pushing you over the 2GB limit. Watching the task manager’s performance tab might give you a better idea if this is the problem.

As for the perfomance regression, I’ve actually been trying to isolate this issue on Linux for some time. In house, we have two Dual Xeons running SuSE 9.0 that showed the same performance regression. We also had customers report this problem with Red Hat 9.0. You’re the first that I know of to report it on Windows.

Researching the web, I found a few instances of similar problems when using a new Linux thread package NPTL (Native Posix Threads Library) which first appeared with Red Hat 9.0 and SuSE 9.0. So I set-up experiment where I ran the same executable (NAS BT Class A) on multiple systems, but I only saw the slow-down on the two Dual Xeons. I also compiled the program with other fortran compilers and got the same result. Finally, when we upgraded the Xeons to SuSE 9.1, the same executable got the expected speed-up. While I’m not 100% sure, I find this pretty good evidence that the problem was with the first NPTL release.

Granted, I can not say that the same holds true for Windows but I highly suspect it’s a similar issue. I’ll try some more experiments and see what I can determine.

Our graphical profiler, pgprof, is available for Windows. To use the tool, first compile your program with “-Mprof=[func|lines]” to insert the profile instrumentation. After your run, launch ‘pgprof’ to view the results.

Good Luck,
Mat