OpenMP: unsupported opcode=OMPTARGETDATA


I have a small piece of code that calls an nvblas routine from a loop. I would like to offload the computation of the (outer) loop on the device, because it is computing stuff; among other things, it is calling BLAS functions. I cannot just run the loops on the host and call nvblas routines: beside the calls to BLAS routines, my loops are computing other things.

So I have put together this tiny example:

#pragma omp target data map( tofrom: C[0:M*N*K*L]) map( to: A[0:M*N*K*L], B[0:M*N*K*L] )
#pragma omp target teams distribute parallel for
    for( int i = 0 ; i < M ; i++ ){
        for( int j = 0 ; j < N ; j++ ){
            for( int k = 0 ; k < K ; k++ ){
                for( int l = 0 ; l < L ; l++ ){
                    C[i*N+j*K+k*L+l] = A[i*N+j*K+k*L+l] + B[i*N+j*K+k*L+l]; 
#pragma omp target data use_device_ptr( A, B, C )
                dgemm( "N", "N", &N, &K, &L, &alpha, &A[i*N+j*K], &K, &B[i*N+j*K], &L, &beta, &C[i*N+j*K], &L );

But when I try to compile it, I get an error that seems to happen on the BLAS call:

$  nvc++ -I$OPENBLAS/include/openblas  -I$CUDAROOT/include -O3      \
         -o simplebench simplebench.cpp -lnvblas -L$CUDAROOT/lib64  \
         -cudalib=cublas -lcublas  -mp=gpu -Minfo=mp \
         -L$OPENBLAS/lib64 -lopenblas
     43, #omp target teams distribute parallel for
         43, Generating Tesla and Multicore code
             Generating "nvkernel_main_F1L43_1" GPU kernel
             Generating map(to:B[:L*(K*(M*N))]) 
             Generating map(tofrom:C[:L*(K*(M*N))]) 
             Generating map(to:A[:L*(K*(M*N))]) 
         47, Loop parallelized across teams and threads(128), schedule(static)
     55, Accelerator restriction: unsupported statement type: opcode=OMPTARGETDATA
[...]/Linux_x86_64/21.5/compilers/share/llvm/bin/opt: /tmp/nvc++HSFcJbDd-HwH.ll:1078:32: error: use of undefined value '%A.addr'
        %138 = load double*, double** %A.addr, align 8, !tbaa !26, !dbg !144

I am using nvc++ 21.5:

$ nvc++ --version

nvc++ 21.5-0 LLVM 64-bit target on x86-64 Linux -tp haswell 
NVIDIA Compilers and Tools
Copyright (c) 2021, NVIDIA CORPORATION.  All rights reserved.

I have seen other error messages reported on this forum that seem close to mine, such as this one that also shows the use of undefined value '%.F0063.addr' error. But mine has unsupported statement type: opcode=OMPTARGETDATA, which I haven’t seen anywhere else.

If I comment out the call to dgemm, it compiles, but I suspect that in this case the compiler might be removing the use_device_ptr that comes before.

Thanks a lot

Hi camomille,

We should be giving a better error message, but it’s not legal to use target data regions within an offload region. You want to remove the “target data use_device_ptr” pragma. Plus it’s unnecessary since the kernel is running on the device, the device pointer is already being used.

A secondary issue is that cuBLAS (which nvBLAS is built upon) doesn’t support calls from within device code, only from the host. It used to, but dropped this support in CUDA 10.

Hence the alternative solution here is to remove the “target teams distribute” pragma so dgemm is call from the host.

The last option is to remove dgemm and instead write a basic matrix-multiply that then can be used in the device code.

Hope this helps,

Hi Matt,

Thanks a lot for this explanation! I will try the solutions you are suggesting…