Introduction
The OpenMP specification offers the OMPT interface for tools which enables profiling and tracing of OpenMP constructs in applications. For barriers, two callbacks exist: ompt_callback_sync_region
and ompt_callback_sync_region_wait
. Both callbacks are optional based on the OpenMP specifications and offer similar functionality. However, there’s an important difference. Looking at an example in the OpenMP specifications (here for implicit barriers), it states [Link, Page 323]:
A thread dispatches a registered
ompt_callback_sync_region
callback for each implicit barrier begin and end event. Similarly, a thread dispatches a registeredompt_callback_sync_region_wait
callback for each implicit barrier wait-begin and wait-end event. All callbacks for implicit barrier events execute in the context of the encountering task and have type signatureompt_callback_sync_region_t
.
Notice the difference between begin / end and wait-begin / wait-end. This is the case for all OpenMP constructs dispatching those two callbacks.
The specification is quite vague about OMPT in general, but the events are described in more detail. For implicit barriers, the following events are described:
The implicit-barrier-begin event occurs in each implicit task at the beginning of an implicit barrier region.
The implicit-barrier-wait-begin event occurs when a task begins an interval of active or passive waiting in an implicit barrier region.
The implicit-barrier-wait-end event occurs when a task ends an interval of active or waiting and resumes execution of an implicit barrier region.
The implicit-barrier-end event occurs in each implicit task after the barrier synchronization on exit from an implicit barrier region.
Bug description
When implementing an example code, spawning tasks from a single thread and then computing something (here a sleep
call), we would expect to see sync_region_wait
callbacks every time the other threads have to wait for new tasks to arrive. The main source code can be seen below:
#include <omp.h>
#include <unistd.h>
int main(void)
{
#pragma omp parallel default(none)
#pragma omp single nowait
{
int num_threads = omp_get_num_threads();
for (int j = 0; j < 4; ++j)
{
for (int i = 0; i < num_threads + 1; ++i)
{
#pragma omp task default(none)
{
usleep(125);
}
}
sleep(3);
}
}
}
However, the runtime does not dispatch the sync_region_wait
correctly. The threads correctly enter sync_region
, but immediately after also enter sync_region_wait
with endpoint = ompt_scope_begin
and dispatch sync_region_wait
with endpoint = ompt_scope_end
just before exiting sync_region_wait
as well, even though tasks are executed, indicating that the threads are in-fact not waiting.
This can be verified by looking at the output of the reproducer, sorted by thread number:
==============
Thread ID = 1
[thread_begin_cb] tid = 1 | type = initial
[implicit_task_cb] tid = 1 | parallel_data = 0 | task_data = 6660001 | endpoint = begin | actual_parallelism = 1 | index = 1 | flags = initial
[parallel_begin_cb] tid = 1 | parallel_data = 7770001 | encountering_task_data = 6660001 | flags = invoker_runtime_team | requested_parallelism = 2 | codeptr_ra = 0x401261
[implicit_task_cb] tid = 1 | parallel_data = 7770001 | task_data = 6660002 | endpoint = begin | actual_parallelism = 2 | index = 0 | flags = implicit_undeferred
[work_cb] tid = 1 | parallel_data = 7770001 | task_data = 6660002 | endpoint = begin | work_type = single_executor | count = 1 | codeptr_ra = 0x4012b0
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660004 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660005 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660006 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660007 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660008 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660009 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660010 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660011 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660012 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660013 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660014 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[task_create_cb] tid = 1 | encountering_task_data = 6660002 | new_task_data = 6660015 | flags = explicit | has_dependences = 0 | codeptr_ra = 0x401359
[work_cb] tid = 1 | parallel_data = 7770001 | task_data = 6660002 | endpoint = end | work_type = single_executor | count = 1 | codeptr_ra = 0x4012e2
[sync_region_cb] tid = 1 | parallel_data = 7770001 | task_data = 6660002 | endpoint = begin | kind = barrier_implicit_parallel | codeptr_ra = 0x401359
[sync_region_wait_cb] tid = 1 | parallel_data = 7770001 | task_data = 6660002 | endpoint = begin | kind = barrier_implicit_parallel | codeptr_ra = 0x401359
[sync_region_wait_cb] tid = 1 | parallel_data = 7777777 | task_data = 6660002 | endpoint = end | kind = barrier_implicit_parallel | codeptr_ra = 0x401359
[sync_region_cb] tid = 1 | parallel_data = 7777777 | task_data = 6660002 | endpoint = end | kind = barrier_implicit_parallel | codeptr_ra = 0x401359
[implicit_task_cb] tid = 1 | parallel_data = 7777777 | task_data = 6660002 | endpoint = end | actual_parallelism = 0 | index = 0 | flags = implicit_undeferred
[parallel_end_cb] tid = 1 | parallel_data = 7770001 | encountering_task_data = 6660001 | flags = invoker_runtime_team | codeptr_ra = 0x401261
[implicit_task_cb] tid = 1 | parallel_data = 7777777 | task_data = 6660001 | endpoint = end | actual_parallelism = 0 | index = 1 | flags = initial
[thread_end_cb] tid = 1
[my_finalize_tool] tid = 1
==============
Thread ID = 2
[thread_begin_cb] tid = 2 | type = worker
[implicit_task_cb] tid = 2 | parallel_data = 7770001 | task_data = 6660003 | endpoint = begin | actual_parallelism = 2 | index = 1 | flags = implicit_undeferred
[work_cb] tid = 2 | parallel_data = 7770001 | task_data = 6660003 | endpoint = begin | work_type = single_other | count = 1 | codeptr_ra = 0x4012b0
[work_cb] tid = 2 | parallel_data = 7770001 | task_data = 6660003 | endpoint = end | work_type = single_other | count = 1 | codeptr_ra = 0x4012b0
[sync_region_cb] tid = 2 | parallel_data = 7770001 | task_data = 6660003 | endpoint = begin | kind = barrier_implicit_parallel | codeptr_ra = (nil)
[sync_region_wait_cb] tid = 2 | parallel_data = 7770001 | task_data = 6660003 | endpoint = begin | kind = barrier_implicit_parallel | codeptr_ra = (nil)
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660004
[task_schedule_cb] tid = 2 | prior_task_data = 6660004 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660006
[task_schedule_cb] tid = 2 | prior_task_data = 6660006 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660005
[task_schedule_cb] tid = 2 | prior_task_data = 6660005 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660007
[task_schedule_cb] tid = 2 | prior_task_data = 6660007 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660009
[task_schedule_cb] tid = 2 | prior_task_data = 6660009 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660008
[task_schedule_cb] tid = 2 | prior_task_data = 6660008 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660010
[task_schedule_cb] tid = 2 | prior_task_data = 6660010 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660012
[task_schedule_cb] tid = 2 | prior_task_data = 6660012 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660011
[task_schedule_cb] tid = 2 | prior_task_data = 6660011 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660013
[task_schedule_cb] tid = 2 | prior_task_data = 6660013 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660015
[task_schedule_cb] tid = 2 | prior_task_data = 6660015 | prior_status = complete | next_task_data = 6660003
[task_schedule_cb] tid = 2 | prior_task_data = 6660003 | prior_status = switch | next_task_data = 6660014
[task_schedule_cb] tid = 2 | prior_task_data = 6660014 | prior_status = complete | next_task_data = 6660003
[sync_region_wait_cb] tid = 2 | parallel_data = 7777777 | task_data = 6660003 | endpoint = end | kind = barrier_implicit_parallel | codeptr_ra = (nil)
[sync_region_cb] tid = 2 | parallel_data = 7777777 | task_data = 6660003 | endpoint = end | kind = barrier_implicit_parallel | codeptr_ra = (nil)
[implicit_task_cb] tid = 2 | parallel_data = 7777777 | task_data = 6660003 | endpoint = end | actual_parallelism = 0 | index = 1 | flags = implicit_undeferred
[thread_end_cb] tid = 2
Even though tid = 2
is inside a sync_region_wait
for an implicit barrier, we can see that tasks are executed.
How to reproduce the issue
I was able to reproduce the bug on a system running Ubuntu 22.04 LTS with all NVHPC versions since 22.11 (I don’t have older versions installed on my system)
The issue can be reproduced by downloading the attached archive, extracting it and running the following commands:
$ make clean && make CC=nvc && make run
reproducer.ZIP (6.3 KB)