OpenMP: unsupported opcode=OMPTARGETDATA

camomille · May 4, 2022, 1:41am

Hello,

I have a small piece of code that calls an nvblas routine from a loop. I would like to offload the computation of the (outer) loop on the device, because it is computing stuff; among other things, it is calling BLAS functions. I cannot just run the loops on the host and call nvblas routines: beside the calls to BLAS routines, my loops are computing other things.

So I have put together this tiny example:

#pragma omp target data map( tofrom: C[0:M*N*K*L]) map( to: A[0:M*N*K*L], B[0:M*N*K*L] )
#pragma omp target teams distribute parallel for
    for( int i = 0 ; i < M ; i++ ){
        for( int j = 0 ; j < N ; j++ ){
            for( int k = 0 ; k < K ; k++ ){
                for( int l = 0 ; l < L ; l++ ){
                    C[i*N+j*K+k*L+l] = A[i*N+j*K+k*L+l] + B[i*N+j*K+k*L+l]; 
                }
            }
#pragma omp target data use_device_ptr( A, B, C )
            {
                dgemm( "N", "N", &N, &K, &L, &alpha, &A[i*N+j*K], &K, &B[i*N+j*K], &L, &beta, &C[i*N+j*K], &L );
            }
        }
    }

But when I try to compile it, I get an error that seems to happen on the BLAS call:

$  nvc++ -I$OPENBLAS/include/openblas  -I$CUDAROOT/include -O3      \
         -o simplebench simplebench.cpp -lnvblas -L$CUDAROOT/lib64  \
         -cudalib=cublas -lcublas  -mp=gpu -Minfo=mp \
         -L$OPENBLAS/lib64 -lopenblas
main:
     43, #omp target teams distribute parallel for
         43, Generating Tesla and Multicore code
             Generating "nvkernel_main_F1L43_1" GPU kernel
             Generating map(to:B[:L*(K*(M*N))]) 
             Generating map(tofrom:C[:L*(K*(M*N))]) 
             Generating map(to:A[:L*(K*(M*N))]) 
         47, Loop parallelized across teams and threads(128), schedule(static)
     55, Accelerator restriction: unsupported statement type: opcode=OMPTARGETDATA
[...]/Linux_x86_64/21.5/compilers/share/llvm/bin/opt: /tmp/nvc++HSFcJbDd-HwH.ll:1078:32: error: use of undefined value '%A.addr'
        %138 = load double*, double** %A.addr, align 8, !tbaa !26, !dbg !144
                                      ^
                                      ^

I am using nvc++ 21.5:

$ nvc++ --version

nvc++ 21.5-0 LLVM 64-bit target on x86-64 Linux -tp haswell 
NVIDIA Compilers and Tools
Copyright (c) 2021, NVIDIA CORPORATION.  All rights reserved.

I have seen other error messages reported on this forum that seem close to mine, such as this one that also shows the use of undefined value '%.F0063.addr' error. But mine has unsupported statement type: opcode=OMPTARGETDATA, which I haven’t seen anywhere else.

If I comment out the call to dgemm, it compiles, but I suspect that in this case the compiler might be removing the use_device_ptr that comes before.

Thanks a lot

MatColgrove · May 4, 2022, 4:56pm

Hi camomille,

We should be giving a better error message, but it’s not legal to use target data regions within an offload region. You want to remove the “target data use_device_ptr” pragma. Plus it’s unnecessary since the kernel is running on the device, the device pointer is already being used.

A secondary issue is that cuBLAS (which nvBLAS is built upon) doesn’t support calls from within device code, only from the host. It used to, but dropped this support in CUDA 10.

Hence the alternative solution here is to remove the “target teams distribute” pragma so dgemm is call from the host.

The last option is to remove dgemm and instead write a basic matrix-multiply that then can be used in the device code.

Hope this helps,
Mat

camomille · May 6, 2022, 1:25pm

Hi Matt,

Thanks a lot for this explanation! I will try the solutions you are suggesting…

Camille

seinsinnes · November 18, 2024, 12:50pm

Hi Mat,

Is this still the case? That cuBLAS doesn’t support calls from within device code.

I’m trying something similar in fortran. I have an openmp offloaded loop which does matrix multiplications. Currently, I have some reference dgemm code which gets compiled into device code. It isn’t performing that well because of the current memory access pattern.

Is there any good options other than bespoke tooling my own dgemm for my use case, since cuBLAS doesn’t seem to be an option?

MatColgrove · November 18, 2024, 4:31pm

Yes.

Is there any good options other than bespoke tooling my own dgemm for my use case, since cuBLAS doesn’t seem to be an option?

Not that I’m aware of.

Sorry,
Mat

seinsinnes · February 7, 2025, 12:21pm

It doesn’t work for the OP’s c++ code but for me it was possible to use the fortran intrinsic matmul inside offloaded loops as an alternative to having a specially tooled device side dgemm.

Topic		Replies	Views
Is_device_pointer and target region question nvc, nvc++ and nvfortran	1	732	October 22, 2021
Questions about omp offload and memory transfer nvc, nvc++ and nvfortran	13	1733	October 15, 2021
OpenMP loops and level 1-2 BLAS CUDA Programming and Performance	4	631	January 1, 2024
cuBLAS in Fortran OpenMP offloading with Managed Memory nvc, nvc++ and nvfortran	6	461	April 19, 2024
Behaviour of OpenMP target maps with Fortran arrays nvc, nvc++ and nvfortran	12	346	February 11, 2025
Use_device in nvhpc/21.5 nvc, nvc++ and nvfortran	3	655	June 17, 2021
OMPT support crashes with target offload program nvc, nvc++ and nvfortran	11	1075	November 16, 2022
OpenMP Target Offloading Bug - Making Target Region in Task nvc, nvc++ and nvfortran hpc , nvcc , a100	2	861	June 7, 2022
Various compiler problems with offloading to blackwell 5060 Ti nvc, nvc++ and nvfortran	10	227	December 9, 2025
Handling global variables inside OpenMP offlload kernels nvc, nvc++ and nvfortran	12	649	March 19, 2024

OpenMP: unsupported opcode=OMPTARGETDATA

Related topics