[OMPT] ompt_callback_sync_region_wait callback does not follow OpenMP specification

Introduction

The OpenMP specification offers the OMPT interface for tools which enables profiling and tracing of OpenMP constructs in applications. For barriers, two callbacks exist: ompt_callback_sync_region and ompt_callback_sync_region_wait. Both callbacks are optional based on the OpenMP specifications and offer similar functionality. However, there’s an important difference. Looking at an example in the OpenMP specifications (here for implicit barriers), it states [Link, Page 323]:

A thread dispatches a registered ompt_callback_sync_region callback for each implicit barrier begin and end event. Similarly, a thread dispatches a registered ompt_callback_sync_region_wait callback for each implicit barrier wait-begin and wait-end event. All callbacks for implicit barrier events execute in the context of the encountering task and have type signature ompt_callback_sync_region_t.

Notice the difference between begin / end and wait-begin / wait-end. This is the case for all OpenMP constructs dispatching those two callbacks.
The specification is quite vague about OMPT in general, but the events are described in more detail. For implicit barriers, the following events are described:

The implicit-barrier-begin event occurs in each implicit task at the beginning of an implicit barrier region.
The implicit-barrier-wait-begin event occurs when a task begins an interval of active or passive waiting in an implicit barrier region.
The implicit-barrier-wait-end event occurs when a task ends an interval of active or waiting and resumes execution of an implicit barrier region.
The implicit-barrier-end event occurs in each implicit task after the barrier synchronization on exit from an implicit barrier region.

Bug description

When implementing an example code, spawning tasks from a single thread and then computing something (here a sleep call), we would expect to see sync_region_wait callbacks every time the other threads have to wait for new tasks to arrive. The main source code can be seen below:

#include <omp.h>
#include <unistd.h>

int main(void)
{
#pragma omp parallel default(none)
#pragma omp single nowait
    {
        int num_threads = omp_get_num_threads();
        for (int j = 0; j < 4; ++j)
        {
            for (int i = 0; i < num_threads + 1; ++i)
            {
#pragma omp task default(none)
                {
                    usleep(125);
                }
            }
            sleep(3);
        }
    }
}

However, the runtime does not dispatch the sync_region_wait correctly. The threads correctly enter sync_region, but immediately after also enter sync_region_wait with endpoint = ompt_scope_begin and dispatch sync_region_wait with endpoint = ompt_scope_end just before exiting sync_region_wait as well, even though tasks are executed, indicating that the threads are in-fact not waiting.

This can be verified by looking at the output of the reproducer, sorted by thread number:

==============
Thread ID = 1
[thread_begin_cb] tid = 1 | type = initial
[implicit_task_cb] tid = 1 | parallel_data = 0 | task_data = 6660001 | endpoint = begin | actual_parallelism = 1 | index = 1 | flags = initial
[parallel_begin_cb] tid = 1 | parallel_data = 7770001 | encountering_task_data = 6660001 | flags = invoker_runtime_team | requested_parallelism = 2 | codeptr_ra = 0x401261
[implicit_task_cb] tid = 1 | parallel_data = 7770001 | task_data = 6660002 | endpoint = begin | actual_parallelism = 2 | index = 0 | flags = implicit_undeferred
[work_cb] tid = 1 | parallel_data = 7770001 | task_data = 6660002 | endpoint = begin | work_type = single_executor | count = 1 | codeptr_ra = 0x4012b0
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660004 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660005 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660006 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660007 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660008 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660009 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660010 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660011 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660012 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660013 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660014 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660015 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[work_cb] tid = 1 | parallel_data = 7770001 | task_data = 6660002 | endpoint = end | work_type = single_executor | count = 1 | codeptr_ra = 0x4012e2
[sync_region_cb] tid = 1 | parallel_data = 7770001 | task_data = 6660002 | endpoint = begin | kind = barrier_implicit_parallel | codeptr_ra = 0x401359
[sync_region_wait_cb] tid = 1 | parallel_data = 7770001 | task_data = 6660002 | endpoint = begin | kind = barrier_implicit_parallel | codeptr_ra = 0x401359
[sync_region_wait_cb] tid = 1 | parallel_data = 7777777 | task_data = 6660002 | endpoint = end | kind = barrier_implicit_parallel | codeptr_ra = 0x401359
[sync_region_cb] tid = 1 | parallel_data = 7777777 | task_data = 6660002 | endpoint = end | kind = barrier_implicit_parallel | codeptr_ra = 0x401359
[implicit_task_cb] tid = 1 | parallel_data = 7777777 | task_data = 6660002 | endpoint = end | actual_parallelism = 0 | index = 0 | flags = implicit_undeferred
[parallel_end_cb] tid = 1 | parallel_data = 7770001 | encountering_task_data = 6660001 | flags = invoker_runtime_team | codeptr_ra = 0x401261
[implicit_task_cb] tid = 1 | parallel_data = 7777777 | task_data = 6660001 | endpoint = end | actual_parallelism = 0 | index = 1 | flags = initial
[thread_end_cb] tid = 1
[my_finalize_tool] tid = 1
==============
Thread ID = 2
[thread_begin_cb] tid = 2 | type = worker
[implicit_task_cb] tid = 2 | parallel_data = 7770001 | task_data = 6660003 | endpoint = begin | actual_parallelism = 2 | index = 1 | flags = implicit_undeferred
[work_cb] tid = 2 | parallel_data = 7770001 | task_data = 6660003 | endpoint = begin | work_type = single_other | count = 1 | codeptr_ra = 0x4012b0
[work_cb] tid = 2 | parallel_data = 7770001 | task_data = 6660003 | endpoint = end | work_type = single_other | count = 1 | codeptr_ra = 0x4012b0
[sync_region_cb] tid = 2 | parallel_data = 7770001 | task_data = 6660003 | endpoint = begin | kind = barrier_implicit_parallel | codeptr_ra = (nil)
[sync_region_wait_cb] tid = 2 | parallel_data = 7770001 | task_data = 6660003 | endpoint = begin | kind = barrier_implicit_parallel | codeptr_ra = (nil)
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660004
[task_schedule_cb] tid = 2 | prior_task_data = 6660004 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660006
[task_schedule_cb] tid = 2 | prior_task_data = 6660006 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660005
[task_schedule_cb] tid = 2 | prior_task_data = 6660005 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660007
[task_schedule_cb] tid = 2 | prior_task_data = 6660007 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660009
[task_schedule_cb] tid = 2 | prior_task_data = 6660009 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660008
[task_schedule_cb] tid = 2 | prior_task_data = 6660008 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660010
[task_schedule_cb] tid = 2 | prior_task_data = 6660010 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660012
[task_schedule_cb] tid = 2 | prior_task_data = 6660012 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660011
[task_schedule_cb] tid = 2 | prior_task_data = 6660011 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660013
[task_schedule_cb] tid = 2 | prior_task_data = 6660013 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660015
[task_schedule_cb] tid = 2 | prior_task_data = 6660015 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660014
[task_schedule_cb] tid = 2 | prior_task_data = 6660014 | prior_status = complete | next_task_data = 6660003
[sync_region_wait_cb] tid = 2 | parallel_data = 7777777 | task_data = 6660003 | endpoint = end | kind = barrier_implicit_parallel | codeptr_ra = (nil)
[sync_region_cb] tid = 2 | parallel_data = 7777777 | task_data = 6660003 | endpoint = end | kind = barrier_implicit_parallel | codeptr_ra = (nil)
[implicit_task_cb] tid = 2 | parallel_data = 7777777 | task_data = 6660003 | endpoint = end | actual_parallelism = 0 | index = 1 | flags = implicit_undeferred
[thread_end_cb] tid = 2

Even though tid = 2 is inside a sync_region_wait for an implicit barrier, we can see that tasks are executed.

How to reproduce the issue

I was able to reproduce the bug on a system running Ubuntu 22.04 LTS with all NVHPC versions since 22.11 (I don’t have older versions installed on my system)

The issue can be reproduced by downloading the attached archive, extracting it and running the following commands:

$ make clean && make CC=nvc && make run

reproducer.ZIP (6.3 KB)

Thanks for the report Jan. I was able to reproduce the issue and have filed a problem report, TPR#34136.

I’ve asked engineering to investigate.

-Mat

Hi Jan,

Engineering took a look but believes that our implementation is correct in this case.

They noted that the OpenMP 5.2 specification does not say that threads may not engage in task execution while within a ‘sync_region_wait’ region (after ‘ompt_scope_begin’ and before ‘ompt_scope_end’ endpoints). Rather task execution can be viewed as part of “active or passive waiting”.

Second, the intent is for our implementation of OpenMP 5.2 to be similar to LLVM Clang 16.0 OpenMP. Testing shows our implementation near identical to Clang’s OMPT execution trace.

Thanks,
Mat

Good morning Mat,

first of all thanks for the update! You are right, the OpenMP specification is quite sparse when it comes to the OMPT interface in general. The meaning of “active or passive waiting” is not clearly defined, as those words only appear in the context of the environment variable OMP_WAIT_POLITY. Here, it is noted that active waiting is implementation defined. In that sense, task execution could be seen as active waiting.

I suspect that this isn’t the intention of the ompt_callback_sync_region_wait callback, as it doesn’t provide any additional information over the normal ompt_callback_sync_region. If the callback should provide additional information, the specification should be more clear.

Thanks again for looking at the issue.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.