Issues with omp_get_max_threads

Kaushik · September 13, 2011, 3:33pm

Here is a very simple test code:

program ompMaxThreads
implicit none
integer :: omp_get_max_threads
print *, omp_get_max_threads()
end program

The above code simply prints the maximum number of threads requested by the user. In theory, the user can control this by setting the environmental variable OMP_NUM_THREADS and the code should simply print the value of this variable. With the PGI compiler (versions comp/pgi-11.5.0 and comp/pgi-11.8.0 tested), the system hangs when OMP_NUM_THREADS is set to a value equal to or greater than the number of CPU cores on the node. Theoretically, a user can oversubscribe thread to cores and there is nothing in the OpenMP standard to prevent this.

By the way, the Intel compiler doesn’t have this problem.

MatColgrove · September 13, 2011, 3:49pm

Hi Kaushik,

While we do give a warning when setting OMP_NUM_THREADS greater then the number cores, (which can be disabled by setting the environment variable MP_WARN to NO), the code shouldn’t hang. I tried your code and it worked as expected. What OS and system are you using? Do you have any other OMP environment variables set? Is binding set? Any other information that may help me recreate the problem?

Thanks,
Mat

% a.out
Warning: OMP_NUM_THREADS or NCPUS (64) greater than available cpus (32)
           64
Warning: omp_set_num_threads (64) greater than available cpus (32)
% setenv MP_WARN NO
% a.out
           64

Kaushik · September 13, 2011, 4:49pm

Hi Mat,

Here’s some more information. I compiled using the following command:

pgf90 -mp ompMaxThreads.f90

using compiler version 11.5 (although 11.8 seems to have the same issue). I do not have any other OMP env vars set other than OMP_NUM_THREADS. Plus, I ran on a node with 2 hex-core Intel Westmere processors running SUSE Linux version 2.6.27.54-0.2-default.

Hopefully that helps you recreate the problem.

Thanks.

Discover · September 14, 2011, 3:02pm

I am working on the same system as Kaushik and we believe we may have found a workaround (though its somewhat ugly). On our cluster we have the default stacksize set to 4GB rather than SLES11’s default of 8KB:

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) 14676329
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127344
max locked memory (kbytes, -l) 8153516
max memory size (kbytes, -m) 13861032
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 4194304
cpu time (seconds, -t) unlimited
max user processes (-u) 127344
virtual memory (kbytes, -v) 19600160
file locks (-x) unlimited

Also note VM limit at ~19GB and max memory size at ~13GB (the node has 24GB of physical memory).

What we see with the openMP turned on and OMP_NUM_THREADS set so that OMP_NUM_THREADS * stacksize is larger than phys mem + swap, the process will hang. If you lower stacksize so that it the number of threads does not exhaust physical memory the code runs fine but uses a ton of virtual memory (15-17GB+).

When the code runs with the situation where stack size is still too high (ie, above phys mem+swap) then the process goes into a busy wait trying to mmap more memory and failing. The process can run at 500%+ of cpu just waiting on the mmap which will never complete.

One of the other developers at our site had this to say about this problem:

“Hmmm… PGI needs to explain what they are doing. The “stacksize” limit is process level and not thread level. Other OpenMP implementations usually have a separate env variable used to control the stacksize for each thread. For example, Intel whose OpenMP implementation is based on the former Kuck & Associates libraries and tools uses the env variable KMP_STACKSIZE to control the stacksize for each thread. The thread stack size, which is private to each thread and used to hold thread-private stack variables, is usually very small compared to the process stacksize. If PGI now uses process level stacksize as the thread stacksize, I think their OpenMP implementation is fundamentally flawed as shown by your tests.”

Are we missing a variable somewhere that we should be setting or is this really a bug?

-Nick

Discover · September 14, 2011, 4:26pm

on a lark, I tried the following which seems to be another workaround:

export OMP_STACKSIZE=1000000
./pgi_omp_test
Numbers of cpus is 8
Warning: OMP_NUM_THREADS or NCPUS (10) greater than available cpus (8)
Max number of threads is 10
Warning: omp_set_num_threads (10) greater than available cpus (8)

So setting OMP_STACKSIZE seems to override the behavior where PGI is getting the stack size set for threads from the user level process stack size. The user guide mentions this variable:

“Running Parallel Programs on Linux
You may encounter difficulties running auto-parallel or OpenMP programs on Linux systems when the
per-thread stack size is set to the default (2MB). If you have unexplained failures, please try setting the
environment variable OMP_STACKSIZE to a larger value, such as 8MB. For information on setting
environment variables, refer to “Setting Environment Variables,” on page 137.”

There is no mention of what happens when the stack size is too large (but I guess we found out).

I guess at this point, the question would be, is inheriting the user process level stacksize the expected behavior. We can have users set OMP_STACKSIZE or set a default that would work better in addition to our standard user limits. We just need to make sure that we understand what is going on here behind the scenes.

MatColgrove · September 14, 2011, 5:46pm

Hi Nick,

I guess at this point, the question would be, is inheriting the user process level stacksize the expected behavior.

Yes. It’s the default setting when OMP_STACKSIZE is not set.

We can have users set OMP_STACKSIZE or set a default that would work better in addition to our standard user limits.

Entirely up to you as to what works best for your site. Though, OMP_STACKSIZE is the OpenMP standard naming so should be supported by other compilers as well as PGI.

Hope this helps,
Mat