OMPT support crashes with target offload program

The OMPT support in the 22.7 runtime is crashing when I try to use the support with a simple target offload program. Here is the backtrace:

Thread 1 "offload.nvidia" received signal SIGSEGV, Segmentation fault.
0x00007ffff54de344 in ompt_callback_target_submit_wrapper (endpoint=<optimized out>, targetData=<optimized out>, 
    hostOpId=0x7fffffffb8f8, requestedNumTeams=0) at ompt.c:151
151	ompt.c: No such file or directory.
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-189.1.el8.x86_64 libatomic-8.5.0-10.el8.x86_64 libgcc-8.5.0-10.el8.x86_64 libstdc++-8.5.0-10.el8.x86_64 zlib-1.2.11-18.el8_5.x86_64
(gdb) bt
#0  0x00007ffff54de344 in ompt_callback_target_submit_wrapper (endpoint=<optimized out>, targetData=<optimized out>, 
    hostOpId=0x7fffffffb8f8, requestedNumTeams=0) at ompt.c:151
#1  0x00007ffff54bb650 in launchHXTarget (filename=<optimized out>, funcname=<optimized out>, lineno=<optimized out>, 
    module=0x60f180 <__PGI_CUDA_LOC>, deviceId=0, hostFuncPtr=0x4029c0 <__nv_main_F1L52_1()>, deviceFuncPtr=0x0, 
    numArgs=<optimized out>, deviceArgBuffer=0x7fffffffbe70, deviceArgBufferSize=16, numTeams=0, threadLimit=0, numThreads=0, 
    preferredNumThreads=1000000, maxThreadsPerBlock=128, maxBlocks=0, mode=mode_target_teams_distribute_parallel_for, flags=7, 
    sharedMemBytes=0, async=-1) at nvomp_target.c:420
#2  0x00007ffff54b5999 in launchTarget (
    filename=0x403560 <.F00021846__Z4gtodv> "/storage/users/khuck/src/openmp_target/test.cpp", 
    funcname=0x403590 <.F00031849__Z4gtodv> "main", lineno=52, module=0x60f180 <__PGI_CUDA_LOC>, deviceId=0, 
    host_ptr=0x4029c0 <__nv_main_F1L52_1()>, args_num=2, args_base=0x7fffffffc630, args=0x7fffffffc620, 
    arg_sizes=0x7fffffffc6e0, arg_types=0x7fffffffc6f0, num_teams=0, thread_limit=0, num_threads=0, mode=<optimized out>, 
    flags=7, loop_trip_count=1000000, sharedMemBytes=0, globalMemBytes=0, async=-1, targetargs_ptr=0x7fffffffc5d0, 
    targetargs_size=6, ndeps=0, dep_list=0x0) at nvomp_target.c:1189
#3  0x00007ffff54b4c9a in __nvomp_target (filename=<optimized out>, funcname=<optimized out>, lineno=<optimized out>, 
    module=<optimized out>, device_id_64bit=<optimized out>, host_ptr=<optimized out>, args_num=<optimized out>, 
    args_base=<optimized out>, args=<optimized out>, arg_sizes=<optimized out>, arg_types=<optimized out>, 
    num_teams=<optimized out>, thread_limit=<optimized out>, num_threads=<optimized out>, mode=<optimized out>, 
    flags=<optimized out>, loop_trip_count=<optimized out>, sharedMemBytes=<optimized out>, globalMemBytes=<optimized out>, 
    nowait=<optimized out>, targetargs_ptr=<optimized out>, targetargs_size=<optimized out>) at nvomp_target.c:1294
#4  0x0000000000401cec in main () at test.cpp:54

The program is:

#include <omp.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
long gtod(void)
{
    struct timeval tv;
    gettimeofday(&tv, (struct timezone*)NULL);
    return 1.e6 * tv.tv_sec + tv.tv_usec;
}
typedef struct {
    double* a;
    double* b;
    int n;
    double alpha;
} mystruct;
void init_struct(mystruct *A, int n, double alpha)
{
    double* aalloc = (double *)malloc(n*sizeof(double));
    A->a = aalloc;
    double* balloc = (double *)malloc(n*sizeof(double));
    A->b = balloc;
    A->n = n;
    A->alpha = alpha;
}
void free_struct(mystruct *A)
{
    A->n = 0;
    free(A->a);
    free(A->b);
    A->a = NULL;
}
int main(int argc, char* argv[])
{
    int N = 1000000;
    mystruct* A;
    A=(mystruct*)malloc(sizeof(mystruct));
    init_struct(A, N, 0.5);
    double* aalloc = A->a;
    double* balloc = A->b;
#pragma omp target enter data map(alloc:aalloc[0:N])
#pragma omp target enter data map(alloc:balloc[0:N])
    printf("Initialize a\n");
    for(int i=0;i<N;i++)A->a[i]=0.1;
    printf("Initialize b\n");
    for(int i=0;i<N;i++)A->b[i]=0.;
    printf("Initialize a on device with update\n");
#pragma omp target update to(aalloc[0:N])
    {
    printf("Initialize b on device\n");
    double* balloc = A->b;
#pragma omp target
#pragma omp teams distribute parallel for
    for(int i=0;i<N;i++)balloc[i]=0.2;
    }
    {
    printf("Sum up a and b on device\n");
    double accu = 0.;
    double* aalloc = A->a;
    double* balloc = A->b;
    double alpha=A->alpha;
#pragma omp target map(tofrom: accu) map(to: alpha)
#pragma omp teams distribute parallel for reduction(+:accu)
    for(int i=0;i<N;i++)accu+=alpha*(aalloc[i]+balloc[i]);
    printf("accu = %le\n", accu);
    }
    {
    printf("Sum up a and b on device again\n");
    long t1     = gtod();
    double accu = 0.;
    double* aalloc = A->a;
    double* balloc = A->b;
    double alpha=A->alpha;
#pragma omp target map(tofrom: accu) map(to: alpha)
#pragma omp teams distribute parallel for reduction(+:accu)
    for(int i=0;i<N;i++)accu+=alpha*(aalloc[i]+balloc[i]);
    long t2       = gtod();
    long int usec = t2 - t1;
    printf("accu = %le\n", accu);
    printf("time/us: %ld\n", usec);
    }
    // verify b is still 0 on host
    for(int i=0;i<5;i++)
        if(balloc[i]>0.)printf("balloc[%d]=%le\n",i,balloc[i]);
#pragma omp target exit data map(delete:aalloc[0:N])
#pragma omp target exit data map(delete:balloc[0:N])
    free_struct(A);
    free(A);
}

and I compiled it with: -Minstrument=functions -fast -mp=gpu,ompt -gpu=cc80 -Minfo=mp,accel test.cpp -o offload.nvidia on a RHEL 8 system with an A100 on it.
Thanks!

Hi khuck,

For good or bad, I’m not able to reproduce the error. The code works for me on multiple systems. Though since you provided the traceback, I sent a note to one of our OpenMP compiler engineers to see if he has idea on what could be wrong and what could account for the difference between our runs.

-Mat

% nvc++ -Minstrument=functions -fast -mp=gpu,ompt -gpu=cc80 test.cpp -V22.7 ; a.out
Initialize a
Initialize b
Initialize a on device with update
Initialize b on device
Sum up a and b on device
accu = 1.500000e+05
Sum up a and b on device again
accu = 1.500000e+05
time/us: 91

Mat -
thanks for looking into it… the program runs fine on its own, but when you run with a tool, it crashes - I should have mentioned that in the first comment. I don’t think it’s tool specific, so any OMPT tool will do.

I ran it using Nsight-System with OMPT support enabled, and it worked fine, so the tool being used may be relevant. Which tool are you using?

I just looked at the email for your account so have a good idea which tool your using. Let me see about getting it installed someplace (if we don’t have it someplace already) and give it a try. Might not be till tomorrow, though.

-Mat

Interesting… I’ll do some more digging to make sure it’s not something on our end.

Yes, we’re using TAU (GitHub - UO-OACISS/tau2: TAU Performance System Public Mirror (Updated every night at midnight, USA Pacific Time)) and APEX (GitHub - UO-OACISS/apex: Autonomic Performance Environment for eXascale (APEX)) to test.

Here’s an easy test - get this callbacks.h file: aomp/callbacks.h at aomp-dev · ROCm-Developer-Tools/aomp · GitHub and include it in the example program that I sent you. That will include a performance tool in the program, and it reproduces the crash. Thanks!

Excellent, thanks!

I was able to reproduce the error and have filed TPR #32358.

Interesting that the call backs are failing to register. Possibly a related issue.

Yes, the “failed” registration is interesting - because I do see a couple of callbacks before the crash. So I don’t know if the registration is returning the wrong error code, or whether that’s something that indicates a related issue…

I found a similar bug report for LLVM: Segfault in clangs ompt_callback system · Issue #55073 · llvm/llvm-project · GitHub

In this case when run using nvc++, the callback gets registered (returns ‘ompt_set_always’) but we get the same runtime segv. So I’m not sure if the failed registration in your code is or is not related to the segv or it’s own separate issue. Though that will be for the compiler engineers to figure out.

Again thanks for the report, it’s appreciated.

Go Ducks!
Mat