I am working on the same system as Kaushik and we believe we may have found a workaround (though its somewhat ugly). On our cluster we have the default stacksize set to 4GB rather than SLES11’s default of 8KB:
core file size (blocks, -c) 0
data seg size (kbytes, -d) 14676329
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127344
max locked memory (kbytes, -l) 8153516
max memory size (kbytes, -m) 13861032
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 4194304
cpu time (seconds, -t) unlimited
max user processes (-u) 127344
virtual memory (kbytes, -v) 19600160
file locks (-x) unlimited
Also note VM limit at ~19GB and max memory size at ~13GB (the node has 24GB of physical memory).
What we see with the openMP turned on and OMP_NUM_THREADS set so that OMP_NUM_THREADS * stacksize is larger than phys mem + swap, the process will hang. If you lower stacksize so that it the number of threads does not exhaust physical memory the code runs fine but uses a ton of virtual memory (15-17GB+).
When the code runs with the situation where stack size is still too high (ie, above phys mem+swap) then the process goes into a busy wait trying to mmap more memory and failing. The process can run at 500%+ of cpu just waiting on the mmap which will never complete.
One of the other developers at our site had this to say about this problem:
“Hmmm… PGI needs to explain what they are doing. The “stacksize” limit is process level and not thread level. Other OpenMP implementations usually have a separate env variable used to control the stacksize for each thread. For example, Intel whose OpenMP implementation is based on the former Kuck & Associates libraries and tools uses the env variable KMP_STACKSIZE to control the stacksize for each thread. The thread stack size, which is private to each thread and used to hold thread-private stack variables, is usually very small compared to the process stacksize. If PGI now uses process level stacksize as the thread stacksize, I think their OpenMP implementation is fundamentally flawed as shown by your tests.”
Are we missing a variable somewhere that we should be setting or is this really a bug?