Various compiler problems with offloading to blackwell 5060 Ti

Hi, when I try to compile my library (the source of which is at)

with nvc++, it does not compile because nvc++ seems to create an error when it comes across a #pragma omp unroll. It could ignore it, but it stops compilation.

Also, I think for small loops, unroll, like the tile directive is rather beneficial, so nvc++ should support that.

My library uses OpenMP on target, together with the message passing interface OpenMPI to work on clusters. If i compile my library with clang, it works except for the following:

For any application compiled with clang that uses gpu offloading and the OpenMessagePassing Interface, and is run with will mpirun -np 8 ./a.out will then yield

The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
Host: localhost
cuMemHostRegister return value: 1
Registration cache: ��u

You may test it with something simple as the following main.cpp

include <omp.h>
include
include <stdio.h>
include <mpi.h>
int main(int argc, char** argv)
{
int process_Rank, size_Of_Cluster;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size_Of_Cluster);
MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank);

printf("Hello World from process %d of %d\\n", process_Rank, size_Of_Cluster);

MPI_Finalize();
return 0;

}

write

export OMPI_CXX=clang++

then mpicxx --show

will yield
clang++ -Wl,-rpath -Wl,/usr/lib64 -Wl,–enable-new-dtags -lmpi

now compile the main with
mpicxx ./main.cpp -std=c++23 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda

which will then just call clang with the above parameters…

Then run with

mpirun -np 8 ./a.out

will then yield the above error.

but after that, the program will (usually) proceed as normal. Nvidia is sometimes involved with the development of llvm and Clang. Can you do something to fix that?

OpenMPI and Cuda is usually an important combination.

Also, Clang does currently not support OpenMP simd as an OpenACC vector, as gcc does now. on the gpu

Unfortunately, when I compile my library with gcc (the latest development branch), then, I get strange memory errors from libgomp when executing the programs sparsetests and mathdemonstrations on my blackwell 5060 TI card.

I posted this at gcc’s bugzilla: Making sure you're not a bot! The gcc team which makes the Openmp code for accellerators is Thomas Schwinge and Tobias Burnus. They say, however, for their graphics cards, my code would work, and they can’t test it for my chip because they have no blackwell card at hand. Gcc currently only compiles for sm_89, but that should not be a problem since cuda should be backward compatible. I don’t know if

Interestingly, the code for sparsetests and mathdemonstrations works when compiled for offload with clang, and I can see the cuda kernels running without errors.

What is more, with gcc, If i use collapse(2) in the first loops of a matrix multiplication, my rtx 5060 Ti will show nonsensical values, Clang compiles this fine, however.

The following simple test program executes a few matrix multiplications.. Single threaded on host, multi threaded with collapse(2) for the first two loops on host, and with target teams distribute for the first loop on gpu, followed by parallel for for the second loop, and finally with target teams distribute parallel for collapse(2) for the first loops. With clang the results of the three multiplications always agree. At the gcc forum, they say, for their cards, this is also true…

For my blackwell, with gcc, i always get different results for the collapse(2) version of the matrix multiplication on gcc.

The question is whether this is a problem of gcc, libcuda, or if my hardware is defect. I use the latest linux kernel 6.17.8

Also interesting is if i use valgrind on the gpu. Of course valgrind can not access gpu pointers, and so should see tons of inaccessible memories. And it does so, which is totally OK, but says stuff like that

==17772== 16 bytes in 1 blocks are definitely lost in loss record 10 of 77
==17772== at 0x490F8D8: malloc (vg_replace_malloc.c:447)
==17772== by 0x11C36E47: ??? (in /usr/lib64/libcuda.so.580.105.08)
==17772== by 0x11D26356: ??? (in /usr/lib64/libcuda.so.580.105.08)
==17772== by 0x11C22C22: ??? (in /usr/lib64/libcuda.so.580.105.08)
==17772== by 0x52875BE: start_thread (in /usr/lib64/libc.so.6)
==17772== by 0x531A283: clone (in /usr/lib64/libc.so.6)

the question is if that is a memory problem in libcuda, since the code below triggered this which does not use malloc on its own.

below are the matrix multiplications. Compiled with clang and options

-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wall,

there is no output, as the results always agree. Compiled with gcc-15 or gcc-16 (git version), and the options

-fopenmp -foffload=nvptx-none -foffload-options=nvptx-none=-march=sm_89 -fno-stack-protector

I always get that the matrix multiplication with collapse(2) would be different from single threaded… Perhaps you can analyze the cuda code and see what goes wrong there, either with the compiler or with libgomp or libcuda or the gpu driver or the blackwell card, Currently, I use the driver 580.95.05 and nvidia-cuda-toolkit-12.9.1-r1 on gentoo, since cuda 13 does not work with many softwares yet and the newest driver .108 was downgraded due to regressions. I tried, however, also the newest driver and cuda-toolkit 13. The memory problems when compiling my library with gcc, the mpi problems when compiling with clang and the matrix multiplication problems when collapse(2) was used on gcc where existent also for cuda-toolkit-13, and the .108 version of the nvidia-drivers with my nvidia geforce rtx 5060 Ti, cuda arch sm_120 card under kernel 6.17.8. My cpu is an amd ryzen 9 3900X 12 core processor ( a bit old, I know), 32gb ram.

#include <omp.h>
#include <stdio.h>
#include<iostream>
#include<vector>
using namespace std;
template<typename T>
class DataBlock
{
public:
    size_t* dpextents;
    size_t* dpstrides;
    T* dpdata;
    size_t dpdatalength;
    DataBlock(size_t *ext,size_t *str, T *dat, size_t datlength):dpextents(ext),dpstrides(str),dpdata(dat),dpdatalength(datlength) {}
};


template<typename T>
void printmatrix(DataBlock<T> &C)
{
    for (size_t i=0; i<C.dpextents[0]; i++)
    {
        for(size_t j=0; j<C.dpextents[1]; j++)
            cout<< C.dpdata[i*C.dpstrides[0]+j*C.dpstrides[1]]<< " ";
        cout<< "\n";
    }
    cout<<"\n\n";
}





template <typename T>
void matrix_multiply_dot_g(  const DataBlock<T>& A, const  DataBlock<T>& B, DataBlock<T>& C,int dev)
{
    const size_t Astr0=A.dpstrides[0];
    const size_t Astr1=A.dpstrides[1];
    const size_t Bstr0=B.dpstrides[0];
    const size_t Bstr1=B.dpstrides[1];
    const size_t Cstr0=C.dpstrides[0];
    const size_t Cstr1=C.dpstrides[1];

    const size_t rows=A.dpextents[0];
    const size_t cols=B.dpextents[1];
    const size_t inner_dim=A.dpextents[1];

    #pragma omp target enter data map(to:A,A.dpdata[0:A.dpdatalength])device(dev)
    #pragma omp target enter data map(to:B,B.dpdata[0:B.dpdatalength])device(dev)
    #pragma omp target enter data map(to:C,C.dpdata[0:C.dpdatalength])device(dev)



    #pragma omp target teams distribute  device(dev)
    for (size_t i = 0; i < rows; ++i)
        #pragma omp parallel for
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = T(0);
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }

    #pragma omp target update from (C.dpdata[0:C.dpdatalength])device(dev)
    #pragma omp target exit data map(release:C.dpdata[0:C.dpdatalength],C)device(dev)
    #pragma omp target exit data map(release:A.dpdata[0:A.dpdatalength],A)device(dev)
    #pragma omp target exit data map(release:B.dpdata[0:B.dpdatalength],B)device(dev)
}



template <typename T>
void matrix_multiply_dot_g_with_collapse(  const DataBlock<T>& A, const  DataBlock<T>& B, DataBlock<T>& C,int dev)
{
    const size_t Astr0=A.dpstrides[0];
    const size_t Astr1=A.dpstrides[1];
    const size_t Bstr0=B.dpstrides[0];
    const size_t Bstr1=B.dpstrides[1];
    const size_t Cstr0=C.dpstrides[0];
    const size_t Cstr1=C.dpstrides[1];

    const size_t rows=A.dpextents[0];
    const size_t cols=B.dpextents[1];
    const size_t inner_dim=A.dpextents[1];

    #pragma omp target enter data map(to:A,A.dpdata[0:A.dpdatalength])device(dev)
    #pragma omp target enter data map(to:B,B.dpdata[0:B.dpdatalength])device(dev)
    #pragma omp target enter data map(to:C,C.dpdata[0:C.dpdatalength])device(dev)



    #pragma omp target teams distribute parallel for collapse(2) device(dev)
    for (size_t i = 0; i < rows; ++i)
        for (size_t j = 0; j < cols; ++j)
        {
            T sum = T(0);
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }

    #pragma omp target update from (C.dpdata[0:C.dpdatalength])device(dev)
    #pragma omp target exit data map(release:C.dpdata[0:C.dpdatalength],C)device(dev)
    #pragma omp target exit data map(release:A.dpdata[0:A.dpdatalength],A)device(dev)
    #pragma omp target exit data map(release:B.dpdata[0:B.dpdatalength],B)device(dev)
}




template <typename T>
void matrix_multiply_dot_w( const DataBlock<T>& A, const  DataBlock<T>& B, DataBlock<T>& C)
{
    const size_t rows=A.dpextents[0];
    const size_t cols=B.dpextents[1];
    const size_t inner_dim=A.dpextents[1];

    const size_t Astr0=A.dpstrides[0];
    const size_t Astr1=A.dpstrides[1];
    const size_t Bstr0=B.dpstrides[0];
    const size_t Bstr1=B.dpstrides[1];
    const size_t Cstr0=C.dpstrides[0];
    const size_t Cstr1=C.dpstrides[1];

    #pragma omp parallel for collapse(2)
    for (size_t i = 0; i < rows; ++i)
    {
        for (size_t j = 0; j < cols; ++j)
        {
            T sum =T(0);
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }
    }
}


template <typename T>
void matrix_multiply_dot_s( const DataBlock<T>& A, const  DataBlock<T>& B, DataBlock<T>& C)
{
    const size_t rows=A.dpextents[0];
    const size_t cols=B.dpextents[1];
    const size_t inner_dim=A.dpextents[1];

    const size_t Astr0=A.dpstrides[0];
    const size_t Astr1=A.dpstrides[1];
    const size_t Bstr0=B.dpstrides[0];
    const size_t Bstr1=B.dpstrides[1];
    const size_t Cstr0=C.dpstrides[0];
    const size_t Cstr1=C.dpstrides[1];

    for (size_t i = 0; i < rows; ++i)
    {
        for (size_t j = 0; j < cols; ++j)
        {
            T sum =T(0);
            for (size_t k = 0; k < inner_dim; ++k)
            {
                sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
            }
            C.dpdata[i*Cstr0+j*Cstr1]= sum;
        }
    }
}



int main(int argc, char** argv)
{


    size_t rows = 12, cols = 12;

    vector<double> A_data=
    {
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
        12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1,
        2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11,
        11, 9, 7, 5, 3, 1, 12, 10, 8, 6, 4, 2,
        3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10,
        10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3,
        4, 8, 12, 3, 7, 11, 2, 6, 10, 1, 5, 9,
        9, 5, 1, 7, 3, 11, 8, 4, 12, 6, 2, 10,
        5, 10, 3, 8, 1, 6, 11, 4, 9, 2, 7, 12,
        12, 7, 2, 9, 4, 11, 6, 1, 8, 3, 10, 5,
        6, 1, 8, 3, 10, 5, 12, 7, 2, 9, 4, 11,
        11, 2, 9, 4, 12, 7, 3, 10, 5, 1, 8, 6
    };
    vector<double> B_data=
    {
        12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1,
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
        3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10,
        10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3,
        5, 10, 3, 8, 1, 6, 11, 4, 9, 2, 7, 12,
        12, 9, 6, 3, 10, 7, 4, 1, 8, 5, 2, 11,
        2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11,
        11, 8, 5, 2, 9, 6, 3, 12, 7, 4, 1, 10,
        3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10,
        10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3,
        4, 8, 12, 3, 7, 11, 2, 6, 10, 1, 5, 9,
        9, 5, 1, 7, 3, 11, 8, 4, 12, 6, 2, 10
    };
size_t i=0;
    do
    {
    vector<double>C0_data(rows*cols,0);

    vector<size_t> rowmajorstrides= {cols,1};
    vector<size_t> extents= {rows,cols};

    DataBlock<double> A= {extents.data(),rowmajorstrides.data(),A_data.data(),12*12};
    DataBlock<double> B= {extents.data(),rowmajorstrides.data(),B_data.data(),12*12};
    DataBlock<double> C0= {extents.data(),rowmajorstrides.data(),C0_data.data(),12*12};

    matrix_multiply_dot_s(A,B,C0);


    vector<double>C1_data(rows*cols,0);
    DataBlock<double> C1= {extents.data(),rowmajorstrides.data(),C1_data.data(),12*12};

    matrix_multiply_dot_w(A,B,C1);


    vector<double>C2_data(rows*cols,0);
    DataBlock<double> C2= {extents.data(),rowmajorstrides.data(),C2_data.data(),12*12};

    matrix_multiply_dot_g(A,B,C2,omp_get_default_device());


    vector<double>C3_data(rows*cols,0);
    DataBlock<double> C3= {extents.data(),rowmajorstrides.data(),C3_data.data(),12*12};

    matrix_multiply_dot_g_with_collapse(A,B,C3,omp_get_default_device());
    i++;
    if(C0_data!=C1_data)
    {

        cout <<"A\n";
        printmatrix(A);
        cout<<"B\n";
        printmatrix(B);
        cout<<"multiplication of A and B:  result0 (single threaded)!= result2 (parallel for collapse(2) at attempt="<<i <<"\n";

        printmatrix(C0);
        printmatrix(C1);
        break;
    }
    if(C0_data!=C2_data)
    {
        cout <<"A\n";
        printmatrix(A);
        cout<<"B\n";
        printmatrix(B);
        cout<<"multiplication of A and B:  result1 (single threaded)!= result2(target teams distribute and parallel for in separate loop) at attempt="<<i<<"\n";

        printmatrix(C0);
        printmatrix(C2);
        break;
    }
    if(C0_data!=C3_data)
    {
        cout <<"A\n";
        printmatrix(A);
        cout<<"B\n";
        printmatrix(B);
        cout<<"multiplication of A and B: result1 (single threaded) != result3 (target teams distribute parallel for collapse(2)) at attempt="<<i<<"\n";
        printmatrix(C0);
        printmatrix(C3);
        break;
    }
    }


    while (i<200);

    return 0;

}

I hope this gets resolved soon. I want to largely upgrade the message passing part of the library so that one can use it well on gpu clusters….

That’s fair so I put in a request to engineering, TPR #38003.

Though some of the other directive you use but we don’t support, like “requires unified_address”, wont be as easy since they can’t be ignored.

Note we have documented the subset of OpenMP 5.0 that we support at: NVIDIA HPC Compilers User's Guide — NVIDIA HPC Compilers User's Guide 25.9 documentation

If you’re able to port you code to only use this subset, I’d be happy to help with any issues you might encounter.

Nvidia is sometimes involved with the development of llvm and Clang. Can you do something to fix that?

The NVHPC team does work with the LLVM community, but this is primarily with the development of flang. We’re not involved with clang’s OpenMP offload efforts so you’re best off working directly with them.

The question is whether this is a problem of gcc, libcuda, or if my hardware is defect.

Unfortunately, I can’t help much there either. I did test the “main.cpp” code you posted on Bugzilla and it seemed to work for me when compiled with nvc++ on a Blackwell.

-Mat

Hi Matt, thank you for your assistance and attempt to improve nvcc,

On omp requires unified_address:

The OpenMP standard dictates that you are not allowed to do pointer arithmetic on device pointers, unless the omp requires unified_address is specified.

When you work with matrices and tensors on device, you often have operations (like extracting a view for a column of a matrix) where you just want to change the pointers to the extents of the matrix (which, since the array is small, you can have on host until you do a computation) and otherwise just change the pointer to the data that may be stored on the device, so that the pointer then points to the start of the extracted column.

Such an operation does not need to start a time taking cuda kernel, but only if you can do pointer arithmetic with the device pointer for the data.

And, according to the OpenMP 5.0 standard, one is only allowed to do this with device pointers when omp requires unified_address is specified.

https://www.openmp.org/spec-html/5.0/openmpse12.html

So for libraries working with arrays, matrices and tensors in C++ under OpenMP with offload, omp requires unified_address is useful. Otherwise, one has to start a cuda kernel for extracting the view of a submatrix, column or row….

By the way, Some time ago, I came across something which I thought was a gcc bug, but then the gcc team claimed that it would be a problem with nvc++ and openacc

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121573

At least that’s what they told me in this report:

That sounds like a bug in nvc++ then. The spec is what GCC is following here as far as I can tell.

This here, however, seems to be a wrong implementation of openacc in gcc

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121178

I am posting this because, well, I want code to run on most compilers as possible (especially when things are supposed to be standardized)…

As for the offloading and Open Message Passing Interface with clang:

I can currently not compile code using the OpenMPI library with flang, It would show compilation errors. I can compile it only with gfortran. But clang/llvm has generally only one offload system… I can not verify it due to reasons stated above, but I would assume if there is a problem with clang++ using offloading and cuda aware OpenMPI, then this may manifest itself also for flang.

In that problematic code, it may be clang tries to acess to the same host memory for each thread, when it was started with 8 nodes on a single device, but I don’t know really…

An OpenMPI dev had these comments: https://github.com/open-mpi/ompi/issues/13431#issuecomment-3555594428

As for offloading and gcc, It seems that there was an abi change with cuda 13. Now it looks like I can’t offload any target with gcc anymore after that nvidia driver update.. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122783

Clang can offload ordinary kernels in applications that do not use the Open Message Passing Interface, because of this fix for cuda-13

https://github.com/llvm/llvm-project/commit/dffd7f3d9a3294d21205251b986e76ec841cc750

which was not done yet in gcc.

To ease this, it may be good if nvidia tells developers if it changes the ABI of cuda…

There is no error appearing in the some offloading programs I compiled with gcc. Just compute-sanitizer shows them. if my application would have been not a simple test but had done something seriously, then it would have ran and computed nonsensical values after a user had upgraded his gpu and cuda.

So, next time, it would be good if you publish changes in your cuda abi, especially if they are introduced by an update of the display driver. Not everybody writes just entertaining chatbots with cuda…

I hope that clang or nvc++ or gcc is soon fixed so that I can further develop the Message Passing support for my library. I want the entire blas functions such that they use hybrid OpenMPI to distribute on the nodes and OpenMP on gpu. And I want to add more tensor support for relativity… And support for combined equations, so that one can compute in a pythonesque way (currently its only lazy evaluation with single operators). Since my library will be smaller, I hope it can be more easily maintained .Once these basics are done, I then want to add code for statistics, and differential equations, autodiff and so on… Solutions like Jax/Tensorflow/Pytorch have a bit too many bugs for me and sometimes not enough mathematical things…

Currently, the algorithms are not too fast (although they are using all three levels of gpu parallelization where ever possible) and some can already distribute workload on Cuda-arare OpenMP cluster nodes. Probably some functions would get faster by replacing a few c++ kernels with fortran code, but that would give only 20% speed at maximum, if at all. So, yes, if more compilers could compile my code, one could try to start benchmarking and see which compiler optimizes best…

For unified memory, you’d add the flag “-gpu=mem:unified”. Though full UM (as opposed to managed memory) requires HMM to be installed on the system (default on Grace-Hopper systems).

By the way, Some time ago, I came across something which I thought was a gcc bug, but then the gcc team claimed that it would be a problem with nvc++ and openacc

Well, this may be difference in interpretation.

“my_struct” is a global that’s statically initialized. Static initialization is done when the binary is loaded so there must be a host copy. Device initialization is done when the first OpenACC construct is encountered. At that time a device copy of “my_struct” is created. However, “device_resisdent” only allocates variables in device memory. There’s no initialization. Hence “device_resisdent” is effectively the same as “declare create” in this case.

GNU may also be initializing the variable, but that’s not specified by the standard.

I’d suggest you’d use “declare copyin”, which is more appropriate for this situation and should work for both compilers.

Hi Matt, I know about the flags of nvc++…

Yet, for my library, I only need device pointers where I can do pointer arithmetic, i.e. where I can have a pointer p on device and then say p_new=p+offset on host, such that p_new is then a device pointer, pointing to the same as if I would write

int *p_new; 

int offset=5;

int *p_old; 

#pragma omp target map (to: offset) map(from: p_new) is_device_ptr(p_old)
{p_new=p_old+offset}

I suspect writing

p_new=pold+offset

on host for this is always possible on nvidia devices? I don’t know really, however.

Are there nvidia devices on the market where device pointers do not support pointer arithmetic?

In any case, I want my library to be standards compilant with OpenMP. And the OpenMP 5 standard says that pointer arithmetic with device pointers is only possible if

#pragma omp requires unified_address

is used….

So I adapted my code accordingly. I don’t need unified memory in the sense of nvidia’s dictionary really. I think in nvidia’s dictionary, unified memory would correspond to

#pragma omp_requires unified_shared_memory.

The difference between these in the OpenMP standard is this:

When the unified_address clause appears on a requires directive, the implementation guarantees that all devices accessible through OpenMP API routines and directives use a unified address space. In this address space, a pointer will always refer to the same location in memory from all devices accessible through OpenMP. The pointers returned by omp_target_alloc and accessed through use_device_ptr are guaranteed to be pointer values that can support pointer arithmetic while still being native device pointers. The is_device_ptr clause is not necessary for device pointers to be translated in target regions, and pointers found not present are not set to null but keep their original value. Memory local to a specific execution context may be exempt from this requirement, following the restrictions of locality to a given execution context, thread, or contention group. Target devices may still have discrete memories and dereferencing a device pointer on the host device or host pointer on a target device remains unspecified behavior.

The unified_shared_memory clause implies the unified_address requirement, inheriting all of its behaviors. Additionally, memory in the device data environment of any device visible to OpenMP, including but not limited to the host, is considered part of the device data environment of all devices accessible through OpenMP except as noted below. Every device address allocated through OpenMP device memory routines is a valid host pointer. Memory local to an execution context as defined in unified_address above may remain part of distinct device data environments as long as the execution context is local to the device containing that environment.

Basically, I think

omp requires unified_shared memory

should be easy to implement. GCC interprets this in cuda terms as pageableMemoryAccess https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html That should be simple for a compiler to query.

For the

omp requires unified address,

that’s basically any device where you can do pointer arithmetic and where host and device addresses don’t overlap, i.e. when you have an address 5 on the host, then the address 5 does not exist on device and vice versa…. Where variables still must be mapped and device pointers cant be de-referenced. I for me only need the pointer arithmetic from #omp requires unified address and map the variables to the device (although my library also supports #pragma omp requires unified_shared_memory as a compiler flag and then does not do any mapping… )

i suspect most nvidia devices support the unified address flag. At least I never have heard that pointer arithmetic would be forbidden after a cudamalloc…

When I look at the cuda sdk, there is an app called nvaccelinfo: that is its output for my rtx 5060 TI

Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Cluster Launch: Yes
Unified Function Pointers: Yes
Unified Memory: No
Memory Models Flags: -gpu=mem:separate, -gpu=mem:managed
Default Target: cc120

So I guess

#pragma omp requires unified address

would be requiring the output of

/opt/nvidia/hpc_sdk/Linux_x86_64/2025/compilers/bin/nvaccelinfo | grep “Unified Addressing”

to be yes and the

#pragma omp requires unified shared memory would be requiring the output of

/opt/nvidia/hpc_sdk/Linux_x86_64/2025/compilers/bin/nvaccelinfo | grep “Unified Memory”

to be yes. So a compiler may implement omp requires without too much difficulty ….

I put in a request, TPR #38008, and we’ll see what engineering can do. Though the OpenMP team has a lot on their plate right now, so it might be for awhile.

Certainly, they are probably transitioning to OpenMP6.

By the way, another very useful directive in OpenMP is the tile construct which seems to speed up gpu computations much:

For small loops, the unroll construct, (strides often have just 2 elements. An unroll is better there than starting threads)…

OpenMP also has a new loop construct, which is interesting

In OpenMP 6.0, there are interesting features like omp_target_memset or omp calloc or omp_target_memset_async. On hardware level, this can, especially if its done asynchroneously and you have many, probably speed up an

#pragma omp target teams distribute parallel for simd map(tofrom: array) map(to: upperbound) 

for (int i=0;i<upperbound;i++)array[i]=0;

Additionally, there are routines to question device memory in OpenMP 6.0. In OpenMP 5, querying omp_numteams would return the teams of the initial device(processor), one needs to nest it within a target region to query the gpu. Now one has a

omp_device_num_teams.

function.

Also useful is a new

#pragma omp atomic compare

in OpenMP 6.0 which allows if clauses in atomic statements. Very Useful for searching arrays for something

And, by the way, nvidia has lend support for clang in its implementation of blackwell:

https://llvm.org/devmtg/2025-04/slides/technical_talk/ozen_blackwell.pdf

So its a bit embarassing that clang/llvm does not recognize

#pragma omp target simd

And that clang has this OpenMP problem with OpenMPI….

https://github.com/llvm/llvm-project/issues/162586

but probably nvidia has only implemented support on a machine code level and the openmp team of llvm has no nvidia support…

(For the OpenMPI problem: OpenMPI developers suspect that clang reserves host memory several times during cuda initialization with the message passing interface, which points to a problem in clangs cuda initialization. But I still must look whether the current git of clang 22 has fixed this already)..

Even if OpenMP constructs like

#pragma omp target simd

are missing from clang 21, the blackwell code of clang 21 is indeed surprisingly fast and uses ways to copy memory asynchroneously when the map pragmas and loops are used… which is what i would expect when people who really know the device have given it support. I hope that

#pragma omp target simd

will be supported in llvm as soon when they have implemented support for

#acc vector

in clang and that nvc++ will cover the OpenMP 6.0 standard soon to a high level. In OpenMP 6.0, they really have thought about gpu devices… Also I hope that gcc will support blackwell soon (as of now gcc just miscompiles for my card and cant open device contexts)…

So far I will wait and then test the new compilers when they arrive that I can work with OpenMPI further and optimize my code with OpenMP 6.0 features…

Best regards,

Benjamin

I somewhat disagree here. Not that “tile” can’t be useful, it can, but it’s highly unlikely most users would use it. OpenACC has had since almost the beginning and I have yet to see it used in any code except the toy examples I’d use for training.

My biggest issue with OpenMP is that it huge, so huge it’s impractical to implement it all. Engineering time is finite so our team needs to prioritize on features that will have the widest benefit. So things like tile, which are good features, but would be rarely used, are given lower priority.

However, features like the “loop” (which is essentially OpenACC), are highly impactful since it makes code more performance portable and much easier to program, we’ve supported it for several years. It’s typically what I suggest users use if they are using OpenMP to offload.

So its a bit embarassing that clang/llvm does not recognize

I wouldn’t be too hard on them. They too need to prioritize their time and they are primarily volunteers so their time may be further constrained.

Like here, put in your requests. If enough users request a feature, it may change the prioritization.

1 Like

Hi Mat,

In my code I did not use tile because I when I tried it some time ago within my library and clang, I got errors. I now know that this was because clang had problems with its map directives and with tile that are now fixed.

In the end this is a hen or egg problem. To have their application portable, people won’t use openmp pragmas if compilers don’t support them yet or just added them recently.

In my Opinion, one uses OpenMP, OpenAcc to

a) improve performance and

b) because one wants to use something standardized that lets code run with various gpu’s without needing adaptions for different models or brands.

So, if I were nvidia, I would at first implement all or as much parts of openmp that can increase performance. This includes loop operations like unroll, tile, or atomic update, capture, atomic compare, perhaps masked, I don’t know if the new assume is much beneficial. It depends on the compiler.

Workshare and workdistribute are probably interesting for Fortran. Cancellation constructs may be nice for searching but difficult to implement on a gpu..

After performance improving constructs, a priority would be routines for memory allocation, especially if they can improve speed.

It is probably slower to initialize an array to zero if i have to write omp_alloc and then a teams distribute parallel for simd loop that starts threads, rather than calling omp_target_calloc or omp_target_memset which then can run a code optimized for the hardware…

Last would come functions to query the hardware. like omp requires, or omp_target_num_teams, omp_target_num_threads and so on.

In the end, you must see at nvidia that with clang and gcc, there are compilers which implement many parts of the OpenMP standard, and already begin to strive for compilance with OpenMP 6.

For a limited team, with limited ressources it may even beneficial, to support these efforts when necessary and possible, rather than trying to maintain an own compiler, because if that own compiler then supports not much of the standard, it is not competitive. So it depends on your ressources, whether you have the ressources to make an own compiler that can compete in that area with gcc and clang, which are large community efforts where many people are involved.

Then the standard of OpenMP needs still some improvements.

Curiously, even though one can query the hardware better in OpenMP6, OpenMP6 does not have a a function yet which can query how much memory on the gpu is available now… Thats a bit of problem of the standard because the mapping macros don’t return a device pointer you could query for Nullptr.

Then, for C++, its of course severe that OpenMP’s mapper does not support templated structs (i.e. template struct{T myval1, T myval2}mystruct; can’t be loaded into the mapper). Also, a problem is that the stl types are not “mappable types”.

At least for clang and gcc, Managed mode seems not to be a solution. As far as I know, for these compilers, this creates a kernel panic, translates that into a caught exception and then offloads a singleton to the gpu, and a next kernel panic occurs until the array is up. With clang and gcc, i noted a severe performance drop with this, when compared to mapping before a loop…

If the stl types were defined as mappable types in the standard, then one could simply offload an stl vector by omp target map . The way as it is now, one has to work with pointers if one wants performance, but thats also ok of course…

A more interesting problem is this: The stl has a new mdspan type which supports not only static, but now also dynamic extents, giving it something from fortran. https://en.cppreference.com/w/cpp/container/mdspan.html

Gcc and clang now can compile large parts of the stl into nvptx and now are able to use some stl types within a target region natively (they just can’t map them up, only declare them as variables within a #pragma omp target {} region on gpu). But in OpenMP, a #pragma omp teams distribute construct has, unfortunately, to be strictly nested into a #pragma omp target construct with no code between them.

This means that one can’t map a pointer of an array up to the gpu with omp target enter data map, and then within a target region, bring it into an mdspan container and afterwards, start an Openmp target teams loop on the mdspan object as with a Fortran array…

Since teams distribute does not tolerate instructions between it and the enclosing target, only an OpenMP parallel for loop would work with this, which is not optimal.

Changing this in the standard, that a teams distribute statement can allow instructions between it and its enclosing target construct would allow very efficient array operations on gpu with stl::mdspan if mdspan is compiled to nvptx…

hm, I tested the new hpc sdk 25.11.

Sadly, when feeding my code into it, i still get these errors:

/home/benni/projects/arraylibrary/openmp/indiceshelperfunctions.h", line 43: error: invalid text in pragma
#pragma omp unroll partial

“/home/benni/projects/arraylibrary/openmp/datablock.h”, line 12: error: invalid text in pragma
#pragma omp requires unified_address

“/home/benni/projects/arraylibrary/openmp/datablock.h”, line 27: error: invalid text in pragma
#pragma omp unroll partial

“/home/benni/projects/arraylibrary/openmp/datablock.h”, line 37: error: invalid text in pragma
#pragma omp unroll

“/home/benni/projects/arraylibrary/openmp/mdspan_omp.h”, line 634: error: invalid text in pragma
#pragma omp unroll partial

I know that 25.11 came out only recently, which was too short to implement things like omp unroll or omp requires. But i read the documentation that at least omp requires should not break compilation, It still does. That is why i mention this here….

Whereas for the case of omp requires, I also think i can make a macro especially for nvc++ (although it is a bit silly, because the conditions for unified_address, which just require pointer arithmetic forbid address overlaps, are even true for an old gtx 1060 ti super from 10 years ago, according to nvidia’s nvaccel in the hpc sdk),

with the unroll and unroll partial directived, replacing the openmp constructd with a macro is not what I want to do.

Since strides can have any dimension, I have loops over them and in the case of matrices, they are rather small. It is reasonable that the programmer determines which loop should get unrolled, which is what this pragma allows the programmer to decide on a per loop basis. nvc++ does this with a command parameter which then holds for all loops…

but that means either not using the unroll directive. In that case, if i then would compile with clang, the code would become unreasonably slow.

Or it would mean that I have to replace

#pragma omp unroll partial

with an ugly preprocessor macro that first checks for nvc++and which translates to

#pragma omp unroll partial

for clang and gcc and to “ “ for nvc++.

I think the unroll directive is useful.

The programmer should decide, which loops one has to unroll, and this should be done on a case by case basis, which is what this pragma allows.

I don’t like it if I have to write macros to prevent the compilation of useful parts of the OpenMP standard.

But of course it may take more time. 25.11 came out shortly after my post and that is certainly too fast to implement such things.

So I will wait for the next hpc sdk’s and test again.

I am also looking forward to see how my cuda aware mpi algorithms works with it and how the code compares to the one generated by clang in different circumstances..