Hi, when I try to compile my library (the source of which is at)
with nvc++, it does not compile because nvc++ seems to create an error when it comes across a #pragma omp unroll. It could ignore it, but it stops compilation.
Also, I think for small loops, unroll, like the tile directive is rather beneficial, so nvc++ should support that.
My library uses OpenMP on target, together with the message passing interface OpenMPI to work on clusters. If i compile my library with clang, it works except for the following:
For any application compiled with clang that uses gpu offloading and the OpenMessagePassing Interface, and is run with will mpirun -np 8 ./a.out will then yield
The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
Host: localhost
cuMemHostRegister return value: 1
Registration cache: ��u
You may test it with something simple as the following main.cpp
include <omp.h>
include
include <stdio.h>
include <mpi.h>
int main(int argc, char** argv)
{
int process_Rank, size_Of_Cluster;MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size_Of_Cluster); MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank); printf("Hello World from process %d of %d\\n", process_Rank, size_Of_Cluster); MPI_Finalize(); return 0;}
write
export OMPI_CXX=clang++
then mpicxx --show
will yield
clang++ -Wl,-rpath -Wl,/usr/lib64 -Wl,–enable-new-dtags -lmpi
now compile the main with
mpicxx ./main.cpp -std=c++23 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda
which will then just call clang with the above parameters…
Then run with
mpirun -np 8 ./a.out
will then yield the above error.
but after that, the program will (usually) proceed as normal. Nvidia is sometimes involved with the development of llvm and Clang. Can you do something to fix that?
OpenMPI and Cuda is usually an important combination.
Also, Clang does currently not support OpenMP simd as an OpenACC vector, as gcc does now. on the gpu
Unfortunately, when I compile my library with gcc (the latest development branch), then, I get strange memory errors from libgomp when executing the programs sparsetests and mathdemonstrations on my blackwell 5060 TI card.
I posted this at gcc’s bugzilla: Making sure you're not a bot! The gcc team which makes the Openmp code for accellerators is Thomas Schwinge and Tobias Burnus. They say, however, for their graphics cards, my code would work, and they can’t test it for my chip because they have no blackwell card at hand. Gcc currently only compiles for sm_89, but that should not be a problem since cuda should be backward compatible. I don’t know if
Interestingly, the code for sparsetests and mathdemonstrations works when compiled for offload with clang, and I can see the cuda kernels running without errors.
What is more, with gcc, If i use collapse(2) in the first loops of a matrix multiplication, my rtx 5060 Ti will show nonsensical values, Clang compiles this fine, however.
The following simple test program executes a few matrix multiplications.. Single threaded on host, multi threaded with collapse(2) for the first two loops on host, and with target teams distribute for the first loop on gpu, followed by parallel for for the second loop, and finally with target teams distribute parallel for collapse(2) for the first loops. With clang the results of the three multiplications always agree. At the gcc forum, they say, for their cards, this is also true…
For my blackwell, with gcc, i always get different results for the collapse(2) version of the matrix multiplication on gcc.
The question is whether this is a problem of gcc, libcuda, or if my hardware is defect. I use the latest linux kernel 6.17.8
Also interesting is if i use valgrind on the gpu. Of course valgrind can not access gpu pointers, and so should see tons of inaccessible memories. And it does so, which is totally OK, but says stuff like that
==17772== 16 bytes in 1 blocks are definitely lost in loss record 10 of 77
==17772== at 0x490F8D8: malloc (vg_replace_malloc.c:447)
==17772== by 0x11C36E47: ??? (in /usr/lib64/libcuda.so.580.105.08)
==17772== by 0x11D26356: ??? (in /usr/lib64/libcuda.so.580.105.08)
==17772== by 0x11C22C22: ??? (in /usr/lib64/libcuda.so.580.105.08)
==17772== by 0x52875BE: start_thread (in /usr/lib64/libc.so.6)
==17772== by 0x531A283: clone (in /usr/lib64/libc.so.6)
the question is if that is a memory problem in libcuda, since the code below triggered this which does not use malloc on its own.
below are the matrix multiplications. Compiled with clang and options
-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wall,
there is no output, as the results always agree. Compiled with gcc-15 or gcc-16 (git version), and the options
-fopenmp -foffload=nvptx-none -foffload-options=nvptx-none=-march=sm_89 -fno-stack-protector
I always get that the matrix multiplication with collapse(2) would be different from single threaded… Perhaps you can analyze the cuda code and see what goes wrong there, either with the compiler or with libgomp or libcuda or the gpu driver or the blackwell card, Currently, I use the driver 580.95.05 and nvidia-cuda-toolkit-12.9.1-r1 on gentoo, since cuda 13 does not work with many softwares yet and the newest driver .108 was downgraded due to regressions. I tried, however, also the newest driver and cuda-toolkit 13. The memory problems when compiling my library with gcc, the mpi problems when compiling with clang and the matrix multiplication problems when collapse(2) was used on gcc where existent also for cuda-toolkit-13, and the .108 version of the nvidia-drivers with my nvidia geforce rtx 5060 Ti, cuda arch sm_120 card under kernel 6.17.8. My cpu is an amd ryzen 9 3900X 12 core processor ( a bit old, I know), 32gb ram.
#include <omp.h>
#include <stdio.h>
#include<iostream>
#include<vector>
using namespace std;
template<typename T>
class DataBlock
{
public:
size_t* dpextents;
size_t* dpstrides;
T* dpdata;
size_t dpdatalength;
DataBlock(size_t *ext,size_t *str, T *dat, size_t datlength):dpextents(ext),dpstrides(str),dpdata(dat),dpdatalength(datlength) {}
};
template<typename T>
void printmatrix(DataBlock<T> &C)
{
for (size_t i=0; i<C.dpextents[0]; i++)
{
for(size_t j=0; j<C.dpextents[1]; j++)
cout<< C.dpdata[i*C.dpstrides[0]+j*C.dpstrides[1]]<< " ";
cout<< "\n";
}
cout<<"\n\n";
}
template <typename T>
void matrix_multiply_dot_g( const DataBlock<T>& A, const DataBlock<T>& B, DataBlock<T>& C,int dev)
{
const size_t Astr0=A.dpstrides[0];
const size_t Astr1=A.dpstrides[1];
const size_t Bstr0=B.dpstrides[0];
const size_t Bstr1=B.dpstrides[1];
const size_t Cstr0=C.dpstrides[0];
const size_t Cstr1=C.dpstrides[1];
const size_t rows=A.dpextents[0];
const size_t cols=B.dpextents[1];
const size_t inner_dim=A.dpextents[1];
#pragma omp target enter data map(to:A,A.dpdata[0:A.dpdatalength])device(dev)
#pragma omp target enter data map(to:B,B.dpdata[0:B.dpdatalength])device(dev)
#pragma omp target enter data map(to:C,C.dpdata[0:C.dpdatalength])device(dev)
#pragma omp target teams distribute device(dev)
for (size_t i = 0; i < rows; ++i)
#pragma omp parallel for
for (size_t j = 0; j < cols; ++j)
{
T sum = T(0);
for (size_t k = 0; k < inner_dim; ++k)
{
sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
}
C.dpdata[i*Cstr0+j*Cstr1]= sum;
}
#pragma omp target update from (C.dpdata[0:C.dpdatalength])device(dev)
#pragma omp target exit data map(release:C.dpdata[0:C.dpdatalength],C)device(dev)
#pragma omp target exit data map(release:A.dpdata[0:A.dpdatalength],A)device(dev)
#pragma omp target exit data map(release:B.dpdata[0:B.dpdatalength],B)device(dev)
}
template <typename T>
void matrix_multiply_dot_g_with_collapse( const DataBlock<T>& A, const DataBlock<T>& B, DataBlock<T>& C,int dev)
{
const size_t Astr0=A.dpstrides[0];
const size_t Astr1=A.dpstrides[1];
const size_t Bstr0=B.dpstrides[0];
const size_t Bstr1=B.dpstrides[1];
const size_t Cstr0=C.dpstrides[0];
const size_t Cstr1=C.dpstrides[1];
const size_t rows=A.dpextents[0];
const size_t cols=B.dpextents[1];
const size_t inner_dim=A.dpextents[1];
#pragma omp target enter data map(to:A,A.dpdata[0:A.dpdatalength])device(dev)
#pragma omp target enter data map(to:B,B.dpdata[0:B.dpdatalength])device(dev)
#pragma omp target enter data map(to:C,C.dpdata[0:C.dpdatalength])device(dev)
#pragma omp target teams distribute parallel for collapse(2) device(dev)
for (size_t i = 0; i < rows; ++i)
for (size_t j = 0; j < cols; ++j)
{
T sum = T(0);
for (size_t k = 0; k < inner_dim; ++k)
{
sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
}
C.dpdata[i*Cstr0+j*Cstr1]= sum;
}
#pragma omp target update from (C.dpdata[0:C.dpdatalength])device(dev)
#pragma omp target exit data map(release:C.dpdata[0:C.dpdatalength],C)device(dev)
#pragma omp target exit data map(release:A.dpdata[0:A.dpdatalength],A)device(dev)
#pragma omp target exit data map(release:B.dpdata[0:B.dpdatalength],B)device(dev)
}
template <typename T>
void matrix_multiply_dot_w( const DataBlock<T>& A, const DataBlock<T>& B, DataBlock<T>& C)
{
const size_t rows=A.dpextents[0];
const size_t cols=B.dpextents[1];
const size_t inner_dim=A.dpextents[1];
const size_t Astr0=A.dpstrides[0];
const size_t Astr1=A.dpstrides[1];
const size_t Bstr0=B.dpstrides[0];
const size_t Bstr1=B.dpstrides[1];
const size_t Cstr0=C.dpstrides[0];
const size_t Cstr1=C.dpstrides[1];
#pragma omp parallel for collapse(2)
for (size_t i = 0; i < rows; ++i)
{
for (size_t j = 0; j < cols; ++j)
{
T sum =T(0);
for (size_t k = 0; k < inner_dim; ++k)
{
sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
}
C.dpdata[i*Cstr0+j*Cstr1]= sum;
}
}
}
template <typename T>
void matrix_multiply_dot_s( const DataBlock<T>& A, const DataBlock<T>& B, DataBlock<T>& C)
{
const size_t rows=A.dpextents[0];
const size_t cols=B.dpextents[1];
const size_t inner_dim=A.dpextents[1];
const size_t Astr0=A.dpstrides[0];
const size_t Astr1=A.dpstrides[1];
const size_t Bstr0=B.dpstrides[0];
const size_t Bstr1=B.dpstrides[1];
const size_t Cstr0=C.dpstrides[0];
const size_t Cstr1=C.dpstrides[1];
for (size_t i = 0; i < rows; ++i)
{
for (size_t j = 0; j < cols; ++j)
{
T sum =T(0);
for (size_t k = 0; k < inner_dim; ++k)
{
sum += A.dpdata[i*Astr0+k*Astr1] *B.dpdata[k*Bstr0+j*Bstr1];
}
C.dpdata[i*Cstr0+j*Cstr1]= sum;
}
}
}
int main(int argc, char** argv)
{
size_t rows = 12, cols = 12;
vector<double> A_data=
{
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1,
2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11,
11, 9, 7, 5, 3, 1, 12, 10, 8, 6, 4, 2,
3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10,
10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3,
4, 8, 12, 3, 7, 11, 2, 6, 10, 1, 5, 9,
9, 5, 1, 7, 3, 11, 8, 4, 12, 6, 2, 10,
5, 10, 3, 8, 1, 6, 11, 4, 9, 2, 7, 12,
12, 7, 2, 9, 4, 11, 6, 1, 8, 3, 10, 5,
6, 1, 8, 3, 10, 5, 12, 7, 2, 9, 4, 11,
11, 2, 9, 4, 12, 7, 3, 10, 5, 1, 8, 6
};
vector<double> B_data=
{
12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10,
10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3,
5, 10, 3, 8, 1, 6, 11, 4, 9, 2, 7, 12,
12, 9, 6, 3, 10, 7, 4, 1, 8, 5, 2, 11,
2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11,
11, 8, 5, 2, 9, 6, 3, 12, 7, 4, 1, 10,
3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, 10,
10, 7, 4, 1, 11, 8, 5, 2, 12, 9, 6, 3,
4, 8, 12, 3, 7, 11, 2, 6, 10, 1, 5, 9,
9, 5, 1, 7, 3, 11, 8, 4, 12, 6, 2, 10
};
size_t i=0;
do
{
vector<double>C0_data(rows*cols,0);
vector<size_t> rowmajorstrides= {cols,1};
vector<size_t> extents= {rows,cols};
DataBlock<double> A= {extents.data(),rowmajorstrides.data(),A_data.data(),12*12};
DataBlock<double> B= {extents.data(),rowmajorstrides.data(),B_data.data(),12*12};
DataBlock<double> C0= {extents.data(),rowmajorstrides.data(),C0_data.data(),12*12};
matrix_multiply_dot_s(A,B,C0);
vector<double>C1_data(rows*cols,0);
DataBlock<double> C1= {extents.data(),rowmajorstrides.data(),C1_data.data(),12*12};
matrix_multiply_dot_w(A,B,C1);
vector<double>C2_data(rows*cols,0);
DataBlock<double> C2= {extents.data(),rowmajorstrides.data(),C2_data.data(),12*12};
matrix_multiply_dot_g(A,B,C2,omp_get_default_device());
vector<double>C3_data(rows*cols,0);
DataBlock<double> C3= {extents.data(),rowmajorstrides.data(),C3_data.data(),12*12};
matrix_multiply_dot_g_with_collapse(A,B,C3,omp_get_default_device());
i++;
if(C0_data!=C1_data)
{
cout <<"A\n";
printmatrix(A);
cout<<"B\n";
printmatrix(B);
cout<<"multiplication of A and B: result0 (single threaded)!= result2 (parallel for collapse(2) at attempt="<<i <<"\n";
printmatrix(C0);
printmatrix(C1);
break;
}
if(C0_data!=C2_data)
{
cout <<"A\n";
printmatrix(A);
cout<<"B\n";
printmatrix(B);
cout<<"multiplication of A and B: result1 (single threaded)!= result2(target teams distribute and parallel for in separate loop) at attempt="<<i<<"\n";
printmatrix(C0);
printmatrix(C2);
break;
}
if(C0_data!=C3_data)
{
cout <<"A\n";
printmatrix(A);
cout<<"B\n";
printmatrix(B);
cout<<"multiplication of A and B: result1 (single threaded) != result3 (target teams distribute parallel for collapse(2)) at attempt="<<i<<"\n";
printmatrix(C0);
printmatrix(C3);
break;
}
}
while (i<200);
return 0;
}
I hope this gets resolved soon. I want to largely upgrade the message passing part of the library so that one can use it well on gpu clusters….