Ompt_callback_work with ompt_scope_end not always dispatched

Using nvc 23.1, host-only:

Although registered as ompt_set_always, the following code doesn’t dispatch ompt_callback_work at ws-loop-end:

#pragma omp parallel
{
    #pragma omp for
    for (int i=1; i<5; i++)
        foo(i);
}

Expected events are:

encountering: parallel_begin
worker: implicit_task( begin )
worker: work( begin )
worker: work( end ) <- expected but missing
# sync for
worker: sync_region( begin )
worker: sync_region_wait( begin )
worker: sync_region_wait( end )
worker: sync_region( end )
# sync parallel
worker: sync_region( begin )
worker: sync_region_wait( begin )
worker: sync_region_wait( end )
worker: sync_region( end )
worker: implicit_task( end )
encountering: parallel_end

Note that the combined construct

#pragma omp parallel for
for (int i=1; i<5; i++)
    foo(i);

dispatches all events as expected, i.e.

encountering: parallel_begin
worker: implicit_task( begin )
worker: work( begin )
worker: work( end )
worker: sync_region( begin )
worker: sync_region_wait( begin )
worker: sync_region_wait( end )
worker: sync_region( end )
worker: implicit_task( end )
encountering: parallel_end

It would be nice if this could be fixed in the next release.

Issue also mentioned here: NVHPC 22.11/23.1 -- OMPT methods can cause SegFault when offloading - #6 by jan.andre.reuter

Thanks,
Christian

Thanks Christian. Do you have the complete example with the callbacks included? We can try to pull something together, but if you have it already, that would be great.

-Mat

ompt-printf-0.tar.gz (122.0 KB)
reproducer2.c (245 Bytes)
reproducer1.c (289 Bytes)

Hi Mat,

please use reproducer1.c and reproducer2.c with the attached ompt-printf-0.tar.gz like this:

tar xf ompt-printf-0.tar.gz
cd ompt-printf-0
./configure CC=nvc --prefix=`pwd`/_install
make install
# See make install's compile and link instructions and build reproducer1 and reproducer2. DOn't forget to add -mp=ompt to the link line.

The output is more verbose than what I posted originally (use low thread count), but it should be straight forward to decode.

Thanks
Christian