unexpected omp for behavior

I was testing some code earlier today that used OpenMP directives to create a set of threads, then had one start an accelerator region while the rest went on to process another set of data in a regular OpenMP for loop using the static schedule. In doing so, some odd behavior cropped up. A minimal example is this snippet of code.

#include <stdio>
#include <unistd>
#include <omp>
#include <accel>

#define SIZE 1000
int main(int argc, char * argv[])
{
    int limit = omp_get_thread_limit();
    printf("limit:%d\n", limit);
    int stuff[SIZE]={0};
    int i;
#pragma omp parallel
    {
        int tid = omp_get_thread_num();
        printf("thread_id:%d\n", tid);
        if(tid > 0)
        {
            int min=SIZE+1;
            int max=-1;
            printf("thread_id:%d inside\n", tid);
#pragma omp for
            for(i = 0; i<SIZE>max) max = i;
                if(i<min) min = i;
                stuff[i] = 1;
            }
            printf("thread_id:%d min=%d, max=%d\n", tid, min, max);
        }else{
            //do something else
        }
    }
    for(i = 0; i<SIZE; i++)
    {
        if(stuff[i] != 1){
            printf("fail after: %d\n", i);
            exit(1);
        }
    }
    return 0;
}

The “do something else” branch runs just fine, but the program hangs indefinitely with threads-1 threads waiting on a barrier. It seems that the runtime expects all threads in the enclosing region to enter an omp for loop and it fails rather badly when they don’t. That being the first problem, I tried this.

#include <stdio>
#include <unistd>
#include <omp>
#include <accel>

#define SIZE 1000
int main(int argc, char * argv[])
{
    int limit = omp_get_thread_limit();
    printf("limit:%d\n", limit);
    int stuff[SIZE]={0};
    int i;
#pragma omp parallel
    {
        int tid = omp_get_thread_num();
        printf("thread_id:%d\n", tid);
        if(tid > 0)
        {
            int min=SIZE+1;
            int max=-1;
            printf("thread_id:%d inside\n", tid);
#pragma omp for nowait
            for(i = 0; i<SIZE>max) max = i;
                if(i<min) min = i;
                stuff[i] = 1;
            }
            printf("thread_id:%d min=%d, max=%d\n", tid, min, max);
        }else{
            //do something else
        }
    }
    for(i = 0; i<SIZE; i++)
    {
        if(stuff[i] != 1){
            printf("fail after: %d\n", i);
            exit(1);
        }
    }
    return 0;
}

Just adding “nowait” to the loop allows it to complete the parallel region by depending on the barrier of the enclosing region. That said, it does not complete correctly, this is the output.

limit:64
thread_id:0
thread_id:1
thread_id:1 inside
thread_id:5
thread_id:5 inside
thread_id:5 min=625, max=749
thread_id:2
thread_id:2 inside
thread_id:2 min=250, max=374
thread_id:7
thread_id:7 inside
thread_id:7 min=875, max=999
thread_id:4
thread_id:4 inside
thread_id:1 min=125, max=249
thread_id:3
thread_id:3 inside
thread_id:3 min=375, max=499
thread_id:4 min=500, max=624
thread_id:6
thread_id:6 inside
thread_id:6 min=750, max=874
fail after: 0

Note that the range acted on by the code is not 0-999 as one would expect, but actually 125-999, this seems to imply that when one of the threads doesn’t appear, its work is simply not done. Changing the scheduler to dynamic fixes the second problem, but if I want to use the static scheduler the system has to be tricked into using the correct range and both cases require nowait to avoid blocking forever. Is this behavior intentional? If not, is there a plan to fix it for 12.0?

Hi njustn,

This the expected and correct behavior.

Upon entry to the OpenMP region your threads are created (let’s say 4 threads). So when the second OpenMP region is entered, 4 threads are expected and the problem is broken up four ways. Also, since you have one thread skip the second region, 3 of the threads sit a barrier waiting for the 4th.

To fix, you’ll need to remove the “#pragma omp for” and manually divide the work in the inner for loop.

Hope this helps,
Mat