Hi, well, i have researched a bit more on this dot product and the compiler.
I do not know if nvc++ has a bug reporting form, like bugzilla or so.
But what I found may be of interest for those who can improve nvc++.
So i am posting it here. It would be great if either these are errors are on my side, or if this are confirmed issues, would lead to actual fixes in the compiler…
This function here:
T norm = sqrt(gpu_dot_product_s(v,v));
is called in my code in a loop, but one which is not parallelizable and thus has no #acc pragma, the dot product is in line 1955 of file
As I said, if i switch this to the vectorized version, then I get a segfault when compiling with O1 or without O…
One may think that this is because of the strides variable in the operator datastruct() (size_t row ) that the dot product uses. Indeed for vectorization, this additional strides field is a bit problematic. Normally, one would declare it constant after construction, or so. But this is a struct that can be offloaded, and updated by a gpu. If I declare these entities constant before offloading, i get mapping errors which are well deserved in that case. But apparently, in the offloaded functions, i can declare the arguments as constant, it does not matter for nvc++ apparently…
Now, lets circumvent the strides issue and the potential operator and write two functions:
#pragma acc routine seq
template
inline T gpu_dot_product_s( const datastruct vec1,const datastruct vec2)
{
const size_t n=vec1.pextents[0];
T result = 0;
const size_t v1s=vec1.pstrides[0];
const size_t v2s=vec2.pstrides[0];
//#pragma acc loop reduction(+:result)
for (size_t i = 0; i < n; ++i)
{
result += vec1.pdata[i *v1s] * vec2.pdata[i *v2s];
}
return result;
}
#pragma acc routine vector
template
inline T gpu_dot_product_v( const datastruct vec1, const datastruct vec2)
{
const size_t n=vec1.pextents[0];
T result = 0;
const size_t v1s=vec1.pstrides[0];
const size_t v2s=vec2.pstrides[0];
//#pragma acc loop vector reduction(+:result)
for (size_t i = 0; i < n; ++i)
{
result += vec1.pdata[i *v1s] * vec2.pdata[i *v2s];
}
return result;
}
despite containing the same code, the function designated as vector will fail if used in line 1955.
One can, promote the function gpu_qr_decomposition to a gang version. A worker dot product would fail too. As well as a vector dot product.
it does not want to use anything, but seq here. Interestingly, this is not so for the matrix multiplications, which are also called. They can even be workers, in O1. and even if the gpu_qr_decomposition is itself a gang or a worker routine, they work.
With the sequential dot product, a problem appears only in O2.
I may i promote gpu_qr_decomposition into a gang function, and use O2. Then, the worker matrix multiplications would become legal. However, then, the following compile failure happens:
537, Complex loop carried dependence of .inl__T2662_19983_16827->pdata->,-> prevents parallelization
Parallelization requires privatization of -> as well as last value
NVC+±W-1052-Reductions on a gang loop in acc routine are not supported (/home/benni/projects/arraylibrary/main_acc.cpp: 328)
NVC+±W-1052-Reductions on a gang loop in acc routine are not supported (/home/benni/projects/arraylibrary/main_acc.cpp: 274)
NVC+±W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Load of NULL symbol (/home/benni/projects/arraylibrary/main_acc.cpp: 1944)
void gpu_qr_decomposition(datastruct&, datastruct, datastruct&, double*, unsigned long):
The problem with this statement is, that, well there is a constructor in line 274, which has a loop.
But,well, i wrote #pragma acc routine seq
template datastruct::datastruct(
So with O2, the compiler basically designates the programmer as stupid and ignores it, when i declare a function explicitely as sequential…
This is unfortunate. since when it igores pragmas which say it should not parallelize a loop, does it nevertheless in o2 and fails, then i can not use o2 and benefit from potential other optimizations…
So lets stick with O1 for now
Also interesting is if one leaves the designation out of gpu_qr_decomposition and just calls it a routine, Instead of ignoring the loop pragmas, nvc++ will fail. This is ok.
But if i promote gpu_qr_decomposition into a gang routine and use -O1, well, one might assume that i can then issue statements like that
#pragma acc parallel loop
for (size_t i=0; i<R.pdatalength; i++)
{
R.pdata[i]=0;
}
instead of the #pragma acc loop vector. But no.
But no. Note that if I write:
#pragma acc routine gang
template
void gpu_qr_decomposition
compilation of this:
#pragma acc parallel loop
for (size_t i=0; i<R.pdatalength; i++)
{
R.pdata[i]=0;
}
ends with the following compilation abort:
e -Minfo=accel -std=c++23 -std=c++20 -acc -gpu=cuda12.6 -Minfo=accel -MD -MT CMakeFiles/arraytest.dir/main_acc.cpp.o -MF CMakeFiles/arraytest.dir/main_acc.cpp.o.d -o CMakeFiles/arraytest.dir/main_acc.cpp.o -c /home/benni/projects/arraylibrary/main_acc.cpp
compute_offset(unsigned long const*, unsigned long*, unsigned long, bool):
14, include “mdspan_acc.h”
133, Generating NVIDIA GPU code
140, #pragma acc loop vector /* threadIdx.x /
Generating reduction(+:offset)
142, Vector barrier inserted for vector loop reduction
149, #pragma acc loop vector / threadIdx.x /
Generating reduction(+:offset)
151, Vector barrier inserted for vector loop reduction
140, Loop is parallelizable
149, Loop is parallelizable
compute_offset(unsigned long, unsigned long, unsigned long, unsigned long):
14, include “mdspan_acc.h”
166, Generating acc routine seq
Generating NVIDIA GPU code
compute_data_length(unsigned long const, unsigned long const*, unsigned long):
14, include “mdspan_acc.h”
201, Generating acc routine seq
Generating NVIDIA GPU code
fill_strides(unsigned long const*, unsigned long*, unsigned long, bool):
14, include “mdspan_acc.h”
281, Generating acc routine seq
Generating NVIDIA GPU code
bool matrix_multiply_dot<double, std::vector<unsigned long, std::allocator>, std::array<unsigned long, 2ul>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::array<unsigned long, 2ul>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&, bool):
14, include “mdspan_acc.h”
3601, Generating enter data copyin(dB.pstrides[:dB.prank],dC)
Generating enter data create(dC.pdata[:dC.pdatalength])
Generating enter data copyin(dC.pextents[:dC.prank],dA.pstrides[:dA.prank],dB,dB.pdata[:dB.pdatalength],dB.pextents[:dB.prank],cols,dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],rows,inner_dim,dC.pstrides[:dC.prank])
Generating present(cols,dA,dB,dC,rows,inner_dim)
Generating NVIDIA GPU code
3621, #pragma acc loop gang collapse(2) /* blockIdx.x /
3623, / blockIdx.x collapsed /
3627, #pragma acc loop vector(128) / threadIdx.x /
Generating reduction(+:sum)
3627, Loop is parallelizable
3648, Generating update self(dC.pdata[:dC.pdatalength])
Generating exit data delete(dC,dC.pdata[:dC.pdatalength],dC.pextents[:dC.prank],dB.pstrides[:dB.prank],dB,dB.pdata[:dB.pdatalength],dB.pextents[:dB.prank],dA.pstrides[:dA.prank],dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],cols,rows,inner_dim,dC.pstrides[:dC.prank])
void cholesky_decomposition<double, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters, unsigned long, bool):
14, include “mdspan_acc.h”
3089, Generating enter data copyin(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dA.pstrides[:dA.prank],dL)
Generating enter data create(dL.pdata[:dL.pdatalength])
Generating enter data copyin(dL.pextents[:dL.prank],step_size,dL.pstrides[:dL.prank])
Generating present(dA,step_size,dL)
Generating NVIDIA GPU code
3104, Generating update self(dL.pdata[:dL.pdatalength])
Generating exit data delete(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dL,dL.pdata[:dL.pdatalength],dL.pextents[:dL.prank],dA.pstrides[:dA.prank],step_size,dL.pstrides[:dL.prank])
void lu_decomposition<double, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters&, unsigned long, bool):
14, include “mdspan_acc.h”
3247, Generating enter data copyin(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dA.pstrides[:dA.prank],dL)
Generating enter data create(dL.pdata[:dL.pdatalength])
Generating enter data copyin(dL.pextents[:dL.prank],dL.pstrides[:dL.prank],dU)
Generating enter data create(dU.pdata[:dU.pdatalength])
Generating enter data copyin(dU.pextents[:dU.prank],dU.pstrides[:dU.prank],step_size)
Generating present(dA,dL,dU,step_size)
Generating NVIDIA GPU code
3267, Generating update self(dU.pdata[:dU.pdatalength],dL.pdata[:dL.pdatalength])
Generating exit data delete(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dL,dL.pdata[:dL.pdatalength],dL.pextents[:dL.prank],dA.pstrides[:dA.prank],dU,dU.pdata[:dU.pdatalength],dU.pextents[:dU.prank],dL.pstrides[:dL.prank],step_size,dU.pstrides[:dU.prank])
void qr_decomposition<double, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters, unsigned long, bool):
14, include “mdspan_acc.h”
3399, Generating enter data copyin(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dA.pstrides[:dA.prank],dQ)
Generating enter data create(dQ.pdata[:dQ.pdatalength])
Generating enter data copyin(dQ.pextents[:dQ.prank],dQ.pstrides[:dQ.prank],dR)
Generating enter data create(dR.pdata[:dR.pdatalength])
Generating enter data copyin(dR.pextents[:dR.prank],dR.pstrides[:dR.prank],step_size)
Generating present(dA,dQ,dR,step_size)
Generating NVIDIA GPU code
CUDA shared memory used for _T438_19957
3418, Generating update self(dR.pdata[:dR.pdatalength],dQ.pdata[:dQ.pdatalength])
Generating exit data delete(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dQ,dQ.pdata[:dQ.pdatalength],dQ.pextents[:dQ.prank],dA.pstrides[:dA.prank],dR,dR.pdata[:dR.pdatalength],dR.pextents[:dR.prank],dQ.pstrides[:dQ.prank],step_size,dR.pstrides[:dR.prank])
bool strassen_multiply<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters const&):
14, include “mdspan_acc.h”
bool winograd_multiply<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters const&):
14, include “mdspan_acc.h”
void gpu_cholesky_decomposition(datastruct&, datastruct&, double, unsigned long):
14, include “mdspan_acc.h”
1726, Generating NVIDIA GPU code
1759, #pragma acc loop vector /* threadIdx.x /
1768, #pragma acc loop seq
1787, #pragma acc loop worker, vector collapse(2) / threadIdx.y threadIdx.x /
1789, / threadIdx.y threadIdx.x collapsed /
1801, #pragma acc loop worker, vector / threadIdx.y threadIdx.x /
Generating reduction(+:temp)
1812, #pragma acc loop worker / threadIdx.y /
1816, #pragma acc loop vector / threadIdx.x /
Generating reduction(+:temp2)
1818, Vector barrier inserted for vector loop reduction
1759, Loop is parallelizable
1787, Loop is parallelizable
1789, Loop is parallelizable
1801, Loop is parallelizable
1812, Loop is parallelizable
1816, Loop is parallelizable
void gpu_matrix_multiply_dot_w(datastruct&, datastruct&, datastruct&):
14, include “mdspan_acc.h”
1655, Generating NVIDIA GPU code
1661, #pragma acc loop worker collapse(2) / threadIdx.y /
1663, / threadIdx.y collapsed /
1667, #pragma acc loop vector / threadIdx.x /
Generating reduction(+:sum)
1669, Vector barrier inserted for vector loop reduction
1661, Loop is parallelizable
1663, Loop is parallelizable
1667, Loop is parallelizable
bool matrix_multiply_dot<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&, bool):
14, include “mdspan_acc.h”
3601, Generating enter data copyin(dB.pstrides[:dB.prank],dC)
Generating enter data create(dC.pdata[:dC.pdatalength])
Generating enter data copyin(dC.pextents[:dC.prank],dA.pstrides[:dA.prank],dB,dB.pdata[:dB.pdatalength],dB.pextents[:dB.prank],cols,dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],rows,inner_dim,dC.pstrides[:dC.prank])
Generating present(cols,dA,dB,dC,rows,inner_dim)
Generating NVIDIA GPU code
3621, #pragma acc loop gang collapse(2) / blockIdx.x /
3623, / blockIdx.x collapsed /
3627, #pragma acc loop vector(128) / threadIdx.x /
Generating reduction(+:sum)
3627, Loop is parallelizable
3648, Generating update self(dC.pdata[:dC.pdatalength])
Generating exit data delete(dC,dC.pdata[:dC.pdatalength],dC.pextents[:dC.prank],dB.pstrides[:dB.prank],dB,dB.pdata[:dB.pdatalength],dB.pextents[:dB.prank],dA.pstrides[:dA.prank],dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],cols,rows,inner_dim,dC.pstrides[:dC.prank])
void gpu_lu_decomposition(datastruct&, datastruct&, datastruct&, double, unsigned long):
14, include “mdspan_acc.h”
1839, Generating NVIDIA GPU code
1869, #pragma acc loop worker, vector /* threadIdx.y threadIdx.x /
1877, #pragma acc loop seq
1895, #pragma acc loop worker, vector collapse(2) / threadIdx.y threadIdx.x /
1897, / threadIdx.y threadIdx.x collapsed /
1906, #pragma acc loop worker / threadIdx.y /
1910, #pragma acc loop vector / threadIdx.x /
Generating reduction(+:temp)
1912, Vector barrier inserted for vector loop reduction
1918, #pragma acc loop worker / threadIdx.y /
1922, #pragma acc loop vector / threadIdx.x /
Generating reduction(+:temp)
1924, Vector barrier inserted for vector loop reduction
1869, Loop is parallelizable
1895, Loop is parallelizable
1897, Loop is parallelizable
1906, Loop is parallelizable
1910, Loop is parallelizable
1918, Loop is parallelizable
1922, Loop is parallelizable
NVC+±S-1065-Unsupported nested compute construct in compute construct or acc routine (/home/benni/projects/arraylibrary/main_acc.cpp: 1988)
double gpu_dot_product_s(datastruct, datastruct):
14, include “mdspan_acc.h”
2383, Generating acc routine seq
Generating NVIDIA GPU code
double dot_product<double, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&):
14, include “mdspan_acc.h”
bool matrix_add<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&):
14, include “mdspan_acc.h”
bool matrix_subtract<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&):
14, include “mdspan_acc.h”
datastruct::datastruct(double, unsigned long, bool, unsigned long, unsigned long*, unsigned long*, bool, bool):
14, include “mdspan_acc.h”
321, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::datastruct(double*, unsigned long, bool, unsigned long, unsigned long, unsigned long*, unsigned long*, bool, bool):
14, include “mdspan_acc.h”
351, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::operator()(unsigned long, unsigned long):
14, include “mdspan_acc.h”
228, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::operator()(unsigned long):
14, include “mdspan_acc.h”
213, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::subspanmatrix(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long*, unsigned long*, double*):
14, include “mdspan_acc.h”
506, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::transpose(unsigned long*, unsigned long*):
14, include “mdspan_acc.h”
267, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::column(unsigned long, unsigned long*):
14, include “mdspan_acc.h”
587, Generating acc routine seq
Generating NVIDIA GPU code
mdspan<double, std::vector<unsigned long, std::allocator>>::subspan(std::vector<unsigned long, std::allocator> const&, std::vector<unsigned long, std::allocator> const&, double*) const:
14, include “mdspan_acc.h”
mdspan<double, std::vector<unsigned long, std::allocator>>::subspanmatrix(unsigned long, unsigned long, unsigned long, unsigned long, double*) const:
14, include “mdspan_acc.h”
__gnu_cxx::__promote<decltype((__gnu_cxx::__promote<unsigned long, std::__is_integer::__value>::__type)(0) + (__gnu_cxx::__promote<double, std::__is_integer::__value>::__type)(0)), std::__is_integer<decltype((__gnu_cxx::__promote<unsigned long, std::__is_integer::__value>::__type)(0) + (__gnu_cxx::__promote<double, std::__is_integer::__value>::__type)(0))>::__value>::__type std::pow<unsigned long, double>(unsigned long, double):
14, include “mdspan_acc.h”
NVC++/x86-64 Linux 25.1-0: compilation completed with severe errors
hm, using pure cuda may be hopefully a bit more stable?
Maybe I can circumvent all these problems with the mapper and those loops by calling cuda directly?
(but then i am no longer device independent)…