Unresolved symbols to OpenACC acc_malloc in my program with nvc++

Hi, i am currently writing a library in C++ that uses open_acc.
I have the latest nvidia sdk 25.1 installed. Yet, when I compile my program with,

nvc++ -std=c++20 -acc=gpu -Minfo=accel ./main_acc.cpp -lrt -lm -lc -lstdc++

I get the following results:
Generating NVIDIA GPU code
nvlink error : Undefined reference to ‘acc_free’ in ‘/tmp/nvc++XyLftM2jz0sj.o’
nvlink error : Undefined reference to ‘acc_malloc’ in ‘/tmp/nvc++XyLftM2jz0sj.o’

pgacclnk: child process exit status 2: /opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/bin/tools/nvdd

I have the following graphics driver. What must I link to in order to get openacc functions available?
It strangely says it parallelizes loops correctly. I can temporarily replace the acc_alloc codes with new in a target region but that is not advisable, since only acc_alloc when called on a target region would put the temporary arrays i need in the correct memory place as I have read…

What is strange is that i can call other functions, like acc_on_device and so on…

Thanks for your help…

nvidia-smi
Fri Jan 24 22:27:09 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77 Driver Version: 565.77 CUDA Version: 12.7 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1660 … Off | 00000000:2D:00.0 On | N/A |
| 31% 40C P8 10W / 125W | 317MiB / 6144MiB | 3% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

here is my code by the way GitHub - bschulz81/AcceleratedLinearAlgebra: A library with some linear algebra functions that works with OpenMP and OpenAcc. in this github projects, the calls to new in the functions gpu_cholesky_decomposition on lines 1639 and 1640 must just be replaced by acc_alloc and similarly, the calls to delete in lines 1721 and 1723 by acc_free, and similarly with the calls to new and delete in the functions gpu_lu_decomposition and gpu_qr_decomposition…

For me, this is a bit unconvenient, since i switched to open_acc because the device mapper had problems in gcc and clang. For clang, i got the following memory related problems in openmp clang 20.0.0 omp device number is 1 despite having 2 devices. This appears to confuse #pragma omp target enter data commands · Issue #122666 · llvm/llvm-project · GitHub in openmp when offloading.

for gcc, I encountered compilation bugs which I reported in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118590 and 118518 – gcc 14.2.1 nvptx cross compiler complains about alias definitions in a struct with two constructors that are not aliases

It was nice when I got some code running on gpu now, but I hope somebody can help me to finally get acc_alloc working.

Hi Schulz.benjamin,

In general, this error means that there’s no device version available of the routine for the device linker to use.

Specifically for “acc_malloc/free”, these can only be called from the host and can’t be used in device code.

If you need to allocate memory from the device, you can use “malloc”, but we strongly advise avoiding device side allocation. Allocations get serialized which can have a negative performance impact and the default heap size is quite small so it’s easy to encounter heap overflows. The heap can be increased via the environment variable NV_ACC_CUDA_HEAPSIZE or by calling cudaDeviceSetLimit.

On a side note, I typically recommend to avoid using the “new” operator on the device as well given this invokes constructors and may introduce exception handling code which may cause issues.

-Mat

Hi thank you for your fast reply.

It seems that it can find the symbol acc_malloc outside of the device region indeed.
However, if i do this:

if (gpu_offload==true)
{

    datastruct<T> dA=A.pdatastruct,dL=L.pdatastruct;
    T*buffer=(T*)acc_malloc(sizeof(T)*(size_t)2*dA.pdatalength);

#pragma acc enter data copyin(dA)
#pragma acc enter data copyin(dA.pdata[0:dA.pdatalength])
#pragma acc enter data copyin(dA.pextents[0:dA.prank])
#pragma acc enter data copyin(dA.pstrides[0:dA.prank])
#pragma acc enter data copyin(dL)
#pragma acc enter data create(dL.pdata[0:dL.pdatalength])
#pragma acc enter data copyin(dL.pextents[0:dL.prank])
#pragma acc enter data copyin(dL.pstrides[0:dL.prank])

#pragma acc enter data copyin(step_size)

#pragma acc kernels present(dA,dL,step_size) deviceptr(buffer)
do
{
gpu_cholesky_decomposition(dA,dL,buffer,step_size);
}
while(false);
#pragma acc update self(dL.pdata[0:dL.pdatalength])

#pragma acc exit data delete(dA.pdata[0:dA.pdatalength])
#pragma acc exit data delete(dA.pextents[0:dA.prank])
#pragma acc exit data delete(dA.pstrides[0:dA.prank])
#pragma acc exit data delete(dA)
#pragma acc exit data delete(dL.pdata[0:dL.pdatalength])
#pragma acc exit data delete(dL.pextents[0:dL.prank])
#pragma acc exit data delete(dL.pstrides[0:dL.prank])
#pragma acc exit data delete(dL)
#pragma acc exit data delete(step_size)
#acc_free(buffer);

with the device routine being this:

#pragma acc routine worker
template
inline void gpu_cholesky_decomposition( datastruct& A, datastruct& L, T*buffer=nullptr, size_t step_size=0)
{

const size_t n = A.pextents[0];
size_t z = 0; // Zero-based indexing, starts at the first column



if(step_size==0)
    step_size=(size_t)pow(n,0.8385);

const size_t tempsize = (n - step_size) * (n - step_size);

size_t pext3[2];
size_t pstrides3[2];

const size_t nn=n*n;

// Allocate memory for S on the device
T* sdata;
T* adata;

if (buffer==(T*) nullptr)
   // sdata=(T*) acc_malloc(sizeof(T)*tempsize);
   {
    sdata=new T[tempsize];
    adata=new T[nn];
   }
else
{
    sdata=buffer;
    adata=buffer+tempsize;
}

#pragma acc loop vector
for (size_t i=0; i<nn; i++)
{
adata[i]=A.pdata[i];
L.pdata[i]=0;
}

further code…

then it would fail with:

[localhost:10075:0:10075] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f40824fbc08)
==== backtrace (tid: 10075) ====
0 0x0000000000041820 __sigaction() ???:0
1 0x000000000041d77d gpu_cholesky_decomposition() /home/benni/projects/arraylibrary/mdspan_acc.h:1752
2 0x0000000000408eed cholesky_decomposition<double, std::vector<unsigned long, std::allocator > >() /home/benni/projects/arraylibrary/mdspan_acc.h:3076
3 0x0000000000404206 main() /home/benni/projects/arraylibrary/main_acc.cpp:84
4 0x00000000000265ce __libc_init_first() ???:0
5 0x0000000000026689 __libc_start_main() ???:0
6 0x0000000000403465 _start() ???:0
=================================

It fails at this line: adata[i]=A.pdata[i];, where adata is a part of the buffer.
I suppose when i allocated an array as a temporary buffer then i should also be able to write to it.
But apparently i am not. I think i have allocated a suitable amount of memory. From my checks the pointer buffer is also not a nullpointer. Yet it seems to have problems to write on it on the device.
The entire sourcecode is here: AcceleratedLinearAlgebra/development at main · bschulz81/AcceleratedLinearAlgebra · GitHub
(but you would have to replace the above lines in the functions cholesky_decomposition and gpu_cholesky_decomposition.

What is also strange: it appears i can not put these mapping macros into another function. (it would complain about string names), which is ok since these pragmas are for the preprocessor. But if I replace them with acc_copyin and acc_create functions i will get memory errors. Strange is also this loop:

#pragma acc kernels present(dA,dL,step_size) deviceptr(buffer)
do
{
gpu_cholesky_decomposition(dA,dL,buffer,step_size);
}
while(false);
#pragma acc update self(dL.pdata[0:dL.pdatalength])

if i would remove it, the acc kernels directive would actually execute the cholesky decomposition twice. If i replace the kernels directive and the wile loop with acc_serial i would also get memory problems with the variables…

All i want is just to execute the function gpu_cholesky_decomposition on the device.

The reason for this is obvious. Unlike in a video game, or in deep learning where one has a trained matrix M that one needs to distribute to customers who type in vectors x and want the result M*x, in math, you usually have, e.g. simulations, which are hundreds of lines long and use many complex algorithms which are best separated into functions. My original goal is of course a function minimizer and then a differential equation solver. they need functions like cholesky decomposition. and these complex functions also may need temporary data for their calculation so instead running a few loops, i want to see whether i can run a function (which has of course many loops) on the device and which needs large temporary memory…

At the moment, only this here is working:

datastruct dA=A.pdatastruct,dL=L.pdatastruct;
#pragma acc enter data copyin(dA)
#pragma acc enter data copyin(dA.pdata[0:dA.pdatalength])
#pragma acc enter data copyin(dA.pextents[0:dA.prank])
#pragma acc enter data copyin(dA.pstrides[0:dA.prank])
#pragma acc enter data copyin(dL)
#pragma acc enter data create(dL.pdata[0:dL.pdatalength])
#pragma acc enter data copyin(dL.pextents[0:dL.prank])
#pragma acc enter data copyin(dL.pstrides[0:dL.prank])

#pragma acc enter data copyin(step_size)

#pragma acc kernels present(dA,dL,step_size)
do
{
gpu_cholesky_decomposition(dA,dL,nullptr,step_size);
}
while(false);
#pragma acc update self(dL.pdata[0:dL.pdatalength])

#pragma acc exit data delete(dA.pdata[0:dA.pdatalength])
#pragma acc exit data delete(dA.pextents[0:dA.prank])
#pragma acc exit data delete(dA.pstrides[0:dA.prank])
#pragma acc exit data delete(dA)
#pragma acc exit data delete(dL.pdata[0:dL.pdatalength])
#pragma acc exit data delete(dL.pextents[0:dL.prank])
#pragma acc exit data delete(dL.pstrides[0:dL.prank])
#pragma acc exit data delete(dL)
#pragma acc exit data delete(step_size)

But then i have to call new on the device for the temporary data, which as you say, would not be ideal. Also, i still have these strange while loop and i need those ugly mapping macros (which would at best be placed into function calls, where then acc_copyin is called). But unfortunately, these obvious solutions did not work.

I do not know why this is so.

In gcc, i noticed problems with the device mapper. I hope this is not there in nvc++. So if somebody can help me a bit i would be greatful.

What is the output from the compiler feedback messages? (i.e. add the flag “-Minfo=accel”)

A seg fault is on the host so what I suspect is happening is that your kernels region isn’t getting offloaded and instead the run is falling back to the host version. Hence when the device address is accessed on the host, the segv occurs.

“do” loops are not parallelizable given the termination condition creates a dependency. So this why it’s likely that the code is running on the host. Though the compiler feedback messages will confirm this.

Parallel loops must be countable (i.e. for loops) where the number of loop iterations is known at the time of the launch.

Hi. it is true that do loops can not be parallelized.

However, with the approach above, the test acc_ondevice(acc_device_nvidia) would return true in the function gpu_cholesky decomposition. So, gpu_cholesky decomposition seems to run on the device.

All that I want to know is how can i just call the function

gpu_cholesky_decomposition(dA,dL,nullptr,step_size);

on the device from the host.

The function itself has of course many loops that are parallelizable and is declared as an acc_worker…
opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/bin/nvc++ -I/home/benni/projects/arraylibrary/arraytest -I/home/benni/projects/arraylibrary -I/opt/cuda/include -I/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/etc/include_acc -std=c++20 -O1 -acc=gpu -mp=multicore -Minfo=accel -std=c++23 -std=c++20 -acc -gpu=cuda12.6 -Minfo=accel -MD -MT CMakeFiles/arraytest.dir/main_acc.cpp.o -MF CMakeFiles/arraytest.dir/main_acc.cpp.o.d -o CMakeFiles/arraytest.dir/main_acc.cpp.o -c /home/benni/projects/arraylibrary/main_acc.cpp
compute_offset(unsigned long const*, unsigned long*, unsigned long, bool):
14, include “mdspan_acc.h”
131, Generating NVIDIA GPU code
138, #pragma acc loop vector /* threadIdx.x /
Generating reduction(+:offset)
140, Vector barrier inserted for vector loop reduction
147, #pragma acc loop vector /
threadIdx.x /
Generating reduction(+:offset)
149, Vector barrier inserted for vector loop reduction
138, Loop is parallelizable
147, Loop is parallelizable
compute_offset(unsigned long, unsigned long, unsigned long, unsigned long):
14, include “mdspan_acc.h”
164, Generating acc routine seq
Generating NVIDIA GPU code
compute_data_length(unsigned long const
, unsigned long const*, unsigned long):
14, include “mdspan_acc.h”
199, Generating acc routine seq
Generating NVIDIA GPU code
fill_strides(unsigned long const*, unsigned long*, unsigned long, bool):
14, include “mdspan_acc.h”
276, Generating acc routine seq
Generating NVIDIA GPU code
bool matrix_multiply_dot<double, std::vector<unsigned long, std::allocator>, std::array<unsigned long, 2ul>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::array<unsigned long, 2ul>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&, bool):
14, include “mdspan_acc.h”
3584, Generating enter data copyin(dB.pstrides[:dB.prank],dC)
Generating enter data create(dC.pdata[:dC.pdatalength])
Generating enter data copyin(dC.pextents[:dC.prank],dA.pstrides[:dA.prank],dB,dB.pdata[:dB.pdatalength],dB.pextents[:dB.prank],cols,dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],rows,inner_dim,dC.pstrides[:dC.prank])
Generating present(cols,dA,dB,dC,rows,inner_dim)
Generating NVIDIA GPU code
3604, #pragma acc loop gang collapse(2) /* blockIdx.x /
3606, /
blockIdx.x collapsed /
3610, #pragma acc loop vector(128) /
threadIdx.x /
Generating reduction(+:sum)
3610, Loop is parallelizable
3631, Generating update self(dC.pdata[:dC.pdatalength])
Generating exit data delete(dC,dC.pdata[:dC.pdatalength],dC.pextents[:dC.prank],dB.pstrides[:dB.prank],dB,dB.pdata[:dB.pdatalength],dB.pextents[:dB.prank],dA.pstrides[:dA.prank],dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],cols,rows,inner_dim,dC.pstrides[:dC.prank])
void cholesky_decomposition<double, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters, unsigned long, bool):
14, include “mdspan_acc.h”
3074, Generating enter data copyin(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dA.pstrides[:dA.prank],dL)
Generating enter data create(dL.pdata[:dL.pdatalength])
Generating enter data copyin(dL.pextents[:dL.prank],step_size,dL.pstrides[:dL.prank])
Generating present(dA,step_size,dL)
Generating NVIDIA GPU code
3090, Generating update self(dL.pdata[:dL.pdatalength])
Generating exit data delete(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dL,dL.pdata[:dL.pdatalength],dL.pextents[:dL.prank],dA.pstrides[:dA.prank],step_size,dL.pstrides[:dL.prank])
void lu_decomposition<double, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters&, unsigned long, bool):
14, include “mdspan_acc.h”
3229, Generating enter data copyin(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dA.pstrides[:dA.prank],dL)
Generating enter data create(dL.pdata[:dL.pdatalength])
Generating enter data copyin(dL.pextents[:dL.prank],dL.pstrides[:dL.prank],dU)
Generating enter data create(dU.pdata[:dU.pdatalength])
Generating enter data copyin(dU.pextents[:dU.prank],dU.pstrides[:dU.prank],step_size)
Generating present(dA,dL,dU,step_size)
Generating NVIDIA GPU code
3251, Generating update self(dU.pdata[:dU.pdatalength],dL.pdata[:dL.pdatalength])
Generating exit data delete(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dL,dL.pdata[:dL.pdatalength],dL.pextents[:dL.prank],dA.pstrides[:dA.prank],dU,dU.pdata[:dU.pdatalength],dU.pextents[:dU.prank],dL.pstrides[:dL.prank],step_size,dU.pstrides[:dU.prank])
void qr_decomposition<double, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters, unsigned long, bool):
14, include “mdspan_acc.h”
3380, Generating enter data copyin(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dA.pstrides[:dA.prank],dQ)
Generating enter data create(dQ.pdata[:dQ.pdatalength])
Generating enter data copyin(dQ.pextents[:dQ.prank],dQ.pstrides[:dQ.prank],dR)
Generating enter data create(dR.pdata[:dR.pdatalength])
Generating enter data copyin(dR.pextents[:dR.prank],dR.pstrides[:dR.prank],step_size)
Generating present(dA,dQ,dR,step_size)
Generating NVIDIA GPU code
3402, Generating update self(dR.pdata[:dR.pdatalength],dQ.pdata[:dQ.pdatalength])
Generating exit data delete(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dQ,dQ.pdata[:dQ.pdatalength],dQ.pextents[:dQ.prank],dA.pstrides[:dA.prank],dR,dR.pdata[:dR.pdatalength],dR.pextents[:dR.prank],dQ.pstrides[:dQ.prank],step_size,dR.pstrides[:dR.prank])
bool strassen_multiply<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters const&):
14, include “mdspan_acc.h”
bool winograd_multiply<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters const&):
14, include “mdspan_acc.h”
void gpu_cholesky_decomposition(datastruct&, datastruct&, double
, unsigned long):
14, include “mdspan_acc.h”
1716, Generating NVIDIA GPU code
1750, #pragma acc loop worker, vector /* threadIdx.y threadIdx.x /
1758, #pragma acc loop seq
1777, #pragma acc loop worker, vector collapse(2) /
threadIdx.y threadIdx.x /
1779, /
threadIdx.y threadIdx.x collapsed /
1791, #pragma acc loop worker, vector /
threadIdx.y threadIdx.x /
Generating reduction(+:temp)
1802, #pragma acc loop worker /
threadIdx.y /
1806, #pragma acc loop vector /
threadIdx.x /
Generating reduction(+:temp2)
1808, Vector barrier inserted for vector loop reduction
1750, Loop is parallelizable
1777, Loop is parallelizable
1779, Loop is parallelizable
1791, Loop is parallelizable
1802, Loop is parallelizable
1806, Loop is parallelizable
void gpu_matrix_multiply_dot_w(datastruct&, datastruct&, datastruct&):
14, include “mdspan_acc.h”
1652, Generating NVIDIA GPU code
1655, #pragma acc loop worker collapse(2) /
threadIdx.y /
1657, /
threadIdx.y collapsed /
1661, #pragma acc loop vector /
threadIdx.x /
Generating reduction(+:sum)
1663, Vector barrier inserted for vector loop reduction
1655, Loop is parallelizable
1657, Loop is parallelizable
1661, Loop is parallelizable
bool matrix_multiply_dot<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&, bool):
14, include “mdspan_acc.h”
3584, Generating enter data copyin(dB.pstrides[:dB.prank],dC)
Generating enter data create(dC.pdata[:dC.pdatalength])
Generating enter data copyin(dC.pextents[:dC.prank],dA.pstrides[:dA.prank],dB,dB.pdata[:dB.pdatalength],dB.pextents[:dB.prank],cols,dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],rows,inner_dim,dC.pstrides[:dC.prank])
Generating present(cols,dA,dB,dC,rows,inner_dim)
Generating NVIDIA GPU code
3604, #pragma acc loop gang collapse(2) /
blockIdx.x /
3606, /
blockIdx.x collapsed /
3610, #pragma acc loop vector(128) /
threadIdx.x /
Generating reduction(+:sum)
3610, Loop is parallelizable
3631, Generating update self(dC.pdata[:dC.pdatalength])
Generating exit data delete(dC,dC.pdata[:dC.pdatalength],dC.pextents[:dC.prank],dB.pstrides[:dB.prank],dB,dB.pdata[:dB.pdatalength],dB.pextents[:dB.prank],dA.pstrides[:dA.prank],dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],cols,rows,inner_dim,dC.pstrides[:dC.prank])
void gpu_lu_decomposition(datastruct&, datastruct&, datastruct&, double
, unsigned long):
14, include “mdspan_acc.h”
1829, Generating NVIDIA GPU code
1859, #pragma acc loop worker, vector /* threadIdx.y threadIdx.x /
1867, #pragma acc loop seq
1885, #pragma acc loop worker, vector collapse(2) /
threadIdx.y threadIdx.x /
1887, /
threadIdx.y threadIdx.x collapsed /
1896, #pragma acc loop worker /
threadIdx.y /
1900, #pragma acc loop vector /
threadIdx.x /
Generating reduction(+:temp)
1902, Vector barrier inserted for vector loop reduction
1908, #pragma acc loop worker /
threadIdx.y /
1912, #pragma acc loop vector /
threadIdx.x /
Generating reduction(+:temp)
1914, Vector barrier inserted for vector loop reduction
1859, Loop is parallelizable
1885, Loop is parallelizable
1887, Loop is parallelizable
1896, Loop is parallelizable
1900, Loop is parallelizable
1908, Loop is parallelizable
1912, Loop is parallelizable
void gpu_qr_decomposition(datastruct&, datastruct, datastruct&, double
, unsigned long):
14, include “mdspan_acc.h”
1934, Generating NVIDIA GPU code
1970, #pragma acc loop worker, vector /* threadIdx.y threadIdx.x /
1975, #pragma acc loop worker, vector /
threadIdx.y threadIdx.x /
1981, #pragma acc loop worker, vector /
threadIdx.y threadIdx.x /
1991, #pragma acc loop seq
2025, #pragma acc loop worker, vector collapse(2) /
threadIdx.y threadIdx.x /
2027, /
threadIdx.y threadIdx.x collapsed /
2039, #pragma acc loop worker /
threadIdx.y /
2047, #pragma acc loop vector /
threadIdx.x /
2056, #pragma acc loop worker, vector /
threadIdx.y threadIdx.x /
2063, #pragma acc loop worker, vector /
threadIdx.y threadIdx.x /
1970, Loop is parallelizable
1975, Loop is parallelizable
1981, Loop is parallelizable
2025, Loop is parallelizable
2027, Loop is parallelizable
2039, Loop is parallelizable
2047, Loop is parallelizable
2056, Loop is parallelizable
2063, Loop is parallelizable
double gpu_dot_product_s(datastruct, datastruct):
14, include “mdspan_acc.h”
2369, Generating acc routine seq
Generating NVIDIA GPU code
double dot_product<double, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&):
14, include “mdspan_acc.h”
bool matrix_add<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&):
14, include “mdspan_acc.h”
bool matrix_subtract<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&):
14, include “mdspan_acc.h”
datastruct::datastruct(double
, unsigned long, bool, unsigned long, unsigned long*, unsigned long*, bool, bool):
14, include “mdspan_acc.h”
316, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::datastruct(double*, unsigned long, bool, unsigned long, unsigned long, unsigned long*, unsigned long*, bool, bool):
14, include “mdspan_acc.h”
346, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::operator()(unsigned long, unsigned long):
14, include “mdspan_acc.h”
218, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::operator()(unsigned long):
14, include “mdspan_acc.h”
211, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::subspanmatrix(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long*, unsigned long*, double*):
14, include “mdspan_acc.h”
501, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::transpose(unsigned long*, unsigned long*):
14, include “mdspan_acc.h”
262, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::column(unsigned long, unsigned long*):
14, include “mdspan_acc.h”
580, Generating acc routine seq
Generating NVIDIA GPU code
mdspan<double, std::vector<unsigned long, std::allocator>>::subspan(std::vector<unsigned long, std::allocator> const&, std::vector<unsigned long, std::allocator> const&, double*) const:
14, include “mdspan_acc.h”
mdspan<double, std::vector<unsigned long, std::allocator>>::subspanmatrix(unsigned long, unsigned long, unsigned long, unsigned long, double*) const:
14, include “mdspan_acc.h”
__gnu_cxx::__promote<decltype((__gnu_cxx::__promote<unsigned long, std::__is_integer::__value>::__type)(0) + (__gnu_cxx::__promote<double, std::__is_integer::__value>::__type)(0)), std::__is_integer<decltype((__gnu_cxx::__promote<unsigned long, std::__is_integer::__value>::__type)(0) + (__gnu_cxx::__promote<double, std::__is_integer::__value>::__type)(0))>::__value>::__type std::pow<unsigned long, double>(unsigned long, double):
14, include “mdspan_acc.h”
[100%] Linking CXX executable arraytest

Actually, when i call it without that loop, acc_kernel executes gpu_cholesky_decomposition twice

this happens if i insert:

if(acc_on_device(acc_device_nvidia)==true)
printf(“on_device”);

into gpu_cholesky decomposition and i supply it with a nullptr as buffer so that it allocates its temporary buffers on the device, i.e. with this code:

    datastruct<T> dA=A.pdatastruct,dL=L.pdatastruct;
    T*buffer=(T*)acc_malloc(sizeof(T)*(size_t)2*dA.pdatalength+dL.pdatalength);

#pragma acc enter data copyin(dA)
#pragma acc enter data copyin(dA.pdata[0:dA.pdatalength])
#pragma acc enter data copyin(dA.pextents[0:dA.prank])
#pragma acc enter data copyin(dA.pstrides[0:dA.prank])
#pragma acc enter data copyin(dL)
#pragma acc enter data create(dL.pdata[0:dL.pdatalength])
#pragma acc enter data copyin(dL.pextents[0:dL.prank])
#pragma acc enter data copyin(dL.pstrides[0:dL.prank])

#pragma acc enter data copyin(step_size)

#pragma acc kernels present(dA,dL,step_size)
do
{
gpu_cholesky_decomposition(dA,dL,(T*) nullptr,step_size);
}
while(false);
#pragma acc update self(dL.pdata[0:dL.pdatalength])

A Cholesky decomposition with the multiplication on gpu
4 12 -16
12 37 -43
-16 -43 98

2 0 0
6 1 0
-8 5 3

Now the cholesky decomposition is entirely done on gpu
on_device2 0 0
6 1 0
-8 5 3

So on_device returns true, even if it is called with the while loop. That is not the problem.

Strangely, i get problems when i remove that while loop…

Apparently, it gets nvc++ gets a bit confused when you want to upload a struct called datastruct that has not only fields but methods, and when you just want to make a single call to a function on gpu (which has then many loops and needs many pointers for data arrays)…

This here:
#pragma acc enter data copyin(step_size) present(dA,dL,step_size) deviceptr(buffer)
{
gpu_cholesky_decomposition(dA,dL,buffer,step_size);
}
#pragma acc update self(dL.pdata[0:dL.pdatalength])

will fail to compile with “invalid text in pragma”…

It can not be that device functions can only be called within loops…

I mean large simulations consist of hundreds of complex functions…
I should be able to call a function called largepde solver on the device, and upload all necessary data.
Also this function it should be able to allocate the temporary arrays it needs during its large computations…
And i need to be able to give this function large arrays as arguments…

With gcc, i had strange problems with the device mapper. And with function aliases.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118518
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118590

Perhaps it is something like this

Hm, i must correct myself. I do not know where I saw this with the twice execution. But this here:

#pragma acc enter data copyin(dA)
#pragma acc enter data copyin(dA.pdata[0:dA.pdatalength])
#pragma acc enter data copyin(dA.pextents[0:dA.prank])
#pragma acc enter data copyin(dA.pstrides[0:dA.prank])
#pragma acc enter data copyin(dL)
#pragma acc enter data create(dL.pdata[0:dL.pdatalength])
#pragma acc enter data copyin(dL.pextents[0:dL.prank])
#pragma acc enter data copyin(dL.pstrides[0:dL.prank])

#pragma acc enter data copyin(step_size)
#pragma acc kernels present(dA,dL, step_size) deviceptr(buffer)
{
gpu_cholesky_decomposition(dA,dL,(T*) nullptr,step_size);
}

#pragma acc update self(dL.pdata[0:dL.pdatalength])

#pragma acc exit data delete(dA.pdata[0:dA.pdatalength])
#pragma acc exit data delete(dA.pextents[0:dA.prank])
#pragma acc exit data delete(dA.pstrides[0:dA.prank])
#pragma acc exit data delete(dA)
#pragma acc exit data delete(dL.pdata[0:dL.pdatalength])
#pragma acc exit data delete(dL.pextents[0:dL.prank])
#pragma acc exit data delete(dL.pstrides[0:dL.prank])
#pragma acc exit data delete(dL)
#pragma acc exit data delete(step_size)

seems also to work.

When i change this into

#pragma acc kernels present(dA,dL, step_size) deviceptr(buffer)
{
gpu_cholesky_decomposition(dA,dL,buffer,step_size);
}

it fails with the acess violation described above… however, what is interesting is if i change this into:

#pragma acc serial present(dA,dL, step_size) deviceptr(buffer)
{
gpu_cholesky_decomposition(dA,dL,buffer,step_size);
}

then, i do not! get an access violation. However, it just prints nonsense:

A Cholesky decomposition with the multiplication on gpu
4 12 -16
12 37 -43
-16 -43 98

2 0 0
6 1 0
-8 5 3

Now the cholesky decomposition is entirely done on gpu
on_device2 1.4822e-323 6.95285e-310
0 0 3.62226e-316
0 -nan 0

See that the on_device message is there. It just appears that then now the fields pdata, pextents,pstrides of the two variables dA and dL of type datastruct are not correctly mapped anymore. (And i do not know if the loops in the gpu_cholesky_decomposition are still executed in parallel, as it should be…

To me, this indicates that there is a problem with the device-mapper. As in gcc.

What i have is a class called mdspan. This is somewhat an extension of the new c++ mdspan. Just that it uses c++ concepts and can use strides and extents on the stack for compile time, as well as on the heap (vectors). In order to facilitate gpu_offload, my mdspan class contains a member variable called datastruct, which has fields for the pointers to the data, the extents, the strides, and the lengths of these arrays (as well as methods for construction, and sub arrays. In contrast to a simple struct with just floats x,y or so, these methods also take memory. it maybe this confuses the mapper somehow that it sometimes does not upload my arrays.

I have encountered something similar in gcc. The code by itself seems to be correct. And on the host, it executes correctly. It is just that there is something with the data mapping. Either i can map dL and dA and its arrays, and then I get problems with the temporary buffer, or i can map the temporary buffer and then the arrays from dA or dL are revieved as zero at the gpu side…

when I use the serial call, the input, which i get on the device for dA and dL is this:

Now the cholesky decomposition is entirely done on gpu
on_device4.000000 12.000000 -16.000000
12.000000 37.000000 -43.000000
-16.000000 -43.000000 98.000000
0.000000 0.000000 0.000000
0.000000 0.000000 0.000000
0.000000 0.000000 0.000000
2 1.4822e-323 6.95294e-310
0 0 2.84949e-315
0 -nan 0

despite me mapping these arrays:

#pragma acc enter data copyin(dA)
#pragma acc enter data copyin(dA.pdata[0:dA.pdatalength])
#pragma acc enter data copyin(dA.pextents[0:dA.prank])
#pragma acc enter data copyin(dA.pstrides[0:dA.prank])
#pragma acc enter data copyin(dL)
#pragma acc enter data create(dL.pdata[0:dL.pdatalength])
#pragma acc enter data copyin(dL.pextents[0:dL.prank])
#pragma acc enter data copyin(dL.pstrides[0:dL.prank])

And yes, as you see, i take care of also copying the lookup tables for dA and dL…

wait no, the second matrix, L should initially be zero.
So the mapping then works. Just that then with acc serial, there is something with the temporary buffer. if i supply a nullptr for the buffer, then it allocates this on the device with new.

so when I run:

#pragma acc kernels present(dA,dL, step_size) deviceptr(buffer)
{
gpu_cholesky_decomposition(dA,dL,(T*)nullptr,step_size);
}
then I get

Now the cholesky decomposition is entirely done on gpu
on_device2 0 0
6 1 0
-8 5 3

as it should be. Just that i now have called new on the device(which I should not).

When I call:
#pragma acc kernels present(dA,dL, step_size) deviceptr(buffer)
{
gpu_cholesky_decomposition(dA,dL,buffer,step_size);
}
then I get:

Now the cholesky decomposition is entirely done on gpu
[localhost:24150:0:24150] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f6c4c4fbc08)
==== backtrace (tid: 24150) ====

and when I call:
#pragma acc serial present(dA,dL, step_size) deviceptr(buffer)
{
gpu_cholesky_decomposition(dA,dL,buffer,step_size);
}

2 1.4822e-323 6.95294e-310
0 0 2.84949e-315
0 -nan 0

In order to ensure that i did not make some fault with pointer arithmetic: the function actually needs two buffers, i can also separate them. Once i acc_alloc this and try to supply this via deviceptr, i get errors. the data for this test was not large. just a 3x3 matrix, which any gpu should be able to handle anywhere…

I had a bit more time this afternoon so downloaded your code to take a closer look.

I applied the above changes, though used two buffers instead of one, but was able to recreate the segv.

What appears to be happening is that at higher optimization, the compiler sees that the do/while loop is always false, so removes it. Secondly, “gpu_cholesky_decomposition” has the “inline” keyword so auto-inlined.

With the “kernels” construct, the compiler has more freedom on how to parallelize the section. Though it’s not offloading the first part of the routine, and hence the device pointer gets dereference and the segv.

If you change “kernels” to “parallel” or use a single iteration for loop, then it will work around the error as this will force the full routine to get offloaded.

#pragma acc parallel present(dA,dL,step_size) deviceptr(buffer1,buffer2)
        do
//for (int ii=0; ii<1;++ii)
        {
//            gpu_cholesky_decomposition(dA,dL,(T*)nullptr,(T*)nullptr,step_size);
            gpu_cholesky_decomposition(dA,dL,buffer1,buffer2,step_size);
        }
        while(false);
#pragma acc update self(dL.pdata[0:dL.pdatalength])

I get a second error later, in “matrix_muiltiplication_parametersmb”, this time an illegal address error which is basically a device segv, but have run out of time for the evening so didn’t investigate. There’s other issues like worker routines calling other worker routines, which technically is illegal OpenACC, so should get cleaned-up.

Though before we go too far, I’m wondering if you’ve looked at cuSOLVER to see if it can be used? These types of algorithms are very difficult to get performing well on a GPU, so if there’s a way to use cuSOLVER which is highly optimized, it may be beneficial.

-Mat

Hi thank you very much.

Indeed it appears that if I write:

#pragma acc parallel present(dA,dQ,dR, step_size) deviceptr(buffer)
{
gpu_qr_decomposition(dA,dQ,dR,buffer,step_size);
}

then it accepts that pointer. and now, what I have on my harddrive does not crash anymore.

unless, however, if i use -O2 or -O3. Interestinly, even if i do not write a loop…

And: notice that I get this ONLY for the qr decomposition. Not for the Cholesky or Lu decomposition. Which is kind of strange.

Entirely on gpu
Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
File: /home/benni/projects/arraylibrary/mdspan_acc.h
Function: _Z16qr_decompositionIdSt6vectorImSaImEEEvR6mdspanIT_T0_ES7_S7_32matrix_multiplication_parametersmb:3361
Line: 3387

It may have to do something with that I declared matrix multiply as a worker routine.
And indeed it should run a parallel loop. but it is called from gpu_qr decomposition, which needs that 3 times and also runs parallel loops by itself.

But one question:
Why is it illegal for worker routines to call worker routines.

That would make it rather impossible to write complex functions on gpu which contain parallel loops but also (outside of these loops call functions that have parallel loops themselves)

As to why i do not want to use cusolver?

Well
1)) i want my program to run on many platforms
2) I am a physicist and currently need a function minimizer for an astronomical problem. Using cuda would mean exactly this: upload something, then compute something, then offload it, then upload again… The point is, i do not only need a single eigenvalue computation or such a solver once, but i am writing a routine that needs to call many such parallel routines. And also I need to write routines that are much more involved than what is just in cuda. (so, i guess unless you put in a relativty theory package, and algorithms for finding solutions for Einstein’s equations on gpu, i guess i have to write it by myself. )…

But thanks for your reply… At least that acc_alloc seems to work now.

Something interesting: in the qr decomposition, there is a call:
T norm = sqrt(gpu_dot_product_s(v,v));

inside of a worker loop.

the qr decomposition is a worker. that dot product is designated as seq. but, well, lets change it to v. (vector).
the qr decomposition is a worker, so can call vector routines.
Just that then, one gets a sigsev even if one is using -O1.

I guess we need a) a new open_acc standard,

  1. where i can call device routines that are not in a loop without using hacks or pseudo loops…
  2. where routines on device can allocate large temporary memory,
  3. where functions designated as worker can call other worker loops… (simply because a function may contain some loops, and also, at places where there is no loop, need a parallel matrix multiplication for intermediate results or something like that)
  4. and where that sigsev problem is fixed…

By the way, there were some technical errors in my code. It was written rather in somewhat of a hurry. I have uploade a version that now even compiles with clang

unfortunately, clang will just output what we had before with that buffer…

Maybe i switch back to open-mp on device.
But open-mp had similar problems when mapping to the device and calling device routines. :-(

Hi, well, i have researched a bit more on this dot product and the compiler.

I do not know if nvc++ has a bug reporting form, like bugzilla or so.
But what I found may be of interest for those who can improve nvc++.

So i am posting it here. It would be great if either these are errors are on my side, or if this are confirmed issues, would lead to actual fixes in the compiler…

This function here:

    T norm = sqrt(gpu_dot_product_s(v,v));

is called in my code in a loop, but one which is not parallelizable and thus has no #acc pragma, the dot product is in line 1955 of file

As I said, if i switch this to the vectorized version, then I get a segfault when compiling with O1 or without O…

One may think that this is because of the strides variable in the operator datastruct() (size_t row ) that the dot product uses. Indeed for vectorization, this additional strides field is a bit problematic. Normally, one would declare it constant after construction, or so. But this is a struct that can be offloaded, and updated by a gpu. If I declare these entities constant before offloading, i get mapping errors which are well deserved in that case. But apparently, in the offloaded functions, i can declare the arguments as constant, it does not matter for nvc++ apparently…

Now, lets circumvent the strides issue and the potential operator and write two functions:

#pragma acc routine seq
template
inline T gpu_dot_product_s( const datastruct vec1,const datastruct vec2)
{
const size_t n=vec1.pextents[0];
T result = 0;
const size_t v1s=vec1.pstrides[0];
const size_t v2s=vec2.pstrides[0];
//#pragma acc loop reduction(+:result)
for (size_t i = 0; i < n; ++i)
{
result += vec1.pdata[i *v1s] * vec2.pdata[i *v2s];
}
return result;
}

#pragma acc routine vector
template
inline T gpu_dot_product_v( const datastruct vec1, const datastruct vec2)
{
const size_t n=vec1.pextents[0];
T result = 0;
const size_t v1s=vec1.pstrides[0];
const size_t v2s=vec2.pstrides[0];
//#pragma acc loop vector reduction(+:result)
for (size_t i = 0; i < n; ++i)
{
result += vec1.pdata[i *v1s] * vec2.pdata[i *v2s];
}
return result;
}

despite containing the same code, the function designated as vector will fail if used in line 1955.

One can, promote the function gpu_qr_decomposition to a gang version. A worker dot product would fail too. As well as a vector dot product.

it does not want to use anything, but seq here. Interestingly, this is not so for the matrix multiplications, which are also called. They can even be workers, in O1. and even if the gpu_qr_decomposition is itself a gang or a worker routine, they work.

With the sequential dot product, a problem appears only in O2.

I may i promote gpu_qr_decomposition into a gang function, and use O2. Then, the worker matrix multiplications would become legal. However, then, the following compile failure happens:

     537, Complex loop carried dependence of .inl__T2662_19983_16827->pdata->,-> prevents parallelization
          Parallelization requires privatization of -> as well as last value

NVC+±W-1052-Reductions on a gang loop in acc routine are not supported (/home/benni/projects/arraylibrary/main_acc.cpp: 328)
NVC+±W-1052-Reductions on a gang loop in acc routine are not supported (/home/benni/projects/arraylibrary/main_acc.cpp: 274)
NVC+±W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Load of NULL symbol (/home/benni/projects/arraylibrary/main_acc.cpp: 1944)
void gpu_qr_decomposition(datastruct&, datastruct, datastruct&, double*, unsigned long):

The problem with this statement is, that, well there is a constructor in line 274, which has a loop.

But,well, i wrote #pragma acc routine seq
template datastruct::datastruct(

So with O2, the compiler basically designates the programmer as stupid and ignores it, when i declare a function explicitely as sequential…

This is unfortunate. since when it igores pragmas which say it should not parallelize a loop, does it nevertheless in o2 and fails, then i can not use o2 and benefit from potential other optimizations…

So lets stick with O1 for now

Also interesting is if one leaves the designation out of gpu_qr_decomposition and just calls it a routine, Instead of ignoring the loop pragmas, nvc++ will fail. This is ok.

But if i promote gpu_qr_decomposition into a gang routine and use -O1, well, one might assume that i can then issue statements like that

#pragma acc parallel loop
for (size_t i=0; i<R.pdatalength; i++)
{
R.pdata[i]=0;
}

instead of the #pragma acc loop vector. But no.

But no. Note that if I write:
#pragma acc routine gang
template
void gpu_qr_decomposition

compilation of this:

#pragma acc parallel loop
for (size_t i=0; i<R.pdatalength; i++)
{
R.pdata[i]=0;
}

ends with the following compilation abort:

e -Minfo=accel -std=c++23 -std=c++20 -acc -gpu=cuda12.6 -Minfo=accel -MD -MT CMakeFiles/arraytest.dir/main_acc.cpp.o -MF CMakeFiles/arraytest.dir/main_acc.cpp.o.d -o CMakeFiles/arraytest.dir/main_acc.cpp.o -c /home/benni/projects/arraylibrary/main_acc.cpp
compute_offset(unsigned long const*, unsigned long*, unsigned long, bool):
14, include “mdspan_acc.h”
133, Generating NVIDIA GPU code
140, #pragma acc loop vector /* threadIdx.x /
Generating reduction(+:offset)
142, Vector barrier inserted for vector loop reduction
149, #pragma acc loop vector /
threadIdx.x /
Generating reduction(+:offset)
151, Vector barrier inserted for vector loop reduction
140, Loop is parallelizable
149, Loop is parallelizable
compute_offset(unsigned long, unsigned long, unsigned long, unsigned long):
14, include “mdspan_acc.h”
166, Generating acc routine seq
Generating NVIDIA GPU code
compute_data_length(unsigned long const
, unsigned long const*, unsigned long):
14, include “mdspan_acc.h”
201, Generating acc routine seq
Generating NVIDIA GPU code
fill_strides(unsigned long const*, unsigned long*, unsigned long, bool):
14, include “mdspan_acc.h”
281, Generating acc routine seq
Generating NVIDIA GPU code
bool matrix_multiply_dot<double, std::vector<unsigned long, std::allocator>, std::array<unsigned long, 2ul>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::array<unsigned long, 2ul>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&, bool):
14, include “mdspan_acc.h”
3601, Generating enter data copyin(dB.pstrides[:dB.prank],dC)
Generating enter data create(dC.pdata[:dC.pdatalength])
Generating enter data copyin(dC.pextents[:dC.prank],dA.pstrides[:dA.prank],dB,dB.pdata[:dB.pdatalength],dB.pextents[:dB.prank],cols,dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],rows,inner_dim,dC.pstrides[:dC.prank])
Generating present(cols,dA,dB,dC,rows,inner_dim)
Generating NVIDIA GPU code
3621, #pragma acc loop gang collapse(2) /* blockIdx.x /
3623, /
blockIdx.x collapsed /
3627, #pragma acc loop vector(128) /
threadIdx.x /
Generating reduction(+:sum)
3627, Loop is parallelizable
3648, Generating update self(dC.pdata[:dC.pdatalength])
Generating exit data delete(dC,dC.pdata[:dC.pdatalength],dC.pextents[:dC.prank],dB.pstrides[:dB.prank],dB,dB.pdata[:dB.pdatalength],dB.pextents[:dB.prank],dA.pstrides[:dA.prank],dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],cols,rows,inner_dim,dC.pstrides[:dC.prank])
void cholesky_decomposition<double, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters, unsigned long, bool):
14, include “mdspan_acc.h”
3089, Generating enter data copyin(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dA.pstrides[:dA.prank],dL)
Generating enter data create(dL.pdata[:dL.pdatalength])
Generating enter data copyin(dL.pextents[:dL.prank],step_size,dL.pstrides[:dL.prank])
Generating present(dA,step_size,dL)
Generating NVIDIA GPU code
3104, Generating update self(dL.pdata[:dL.pdatalength])
Generating exit data delete(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dL,dL.pdata[:dL.pdatalength],dL.pextents[:dL.prank],dA.pstrides[:dA.prank],step_size,dL.pstrides[:dL.prank])
void lu_decomposition<double, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters&, unsigned long, bool):
14, include “mdspan_acc.h”
3247, Generating enter data copyin(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dA.pstrides[:dA.prank],dL)
Generating enter data create(dL.pdata[:dL.pdatalength])
Generating enter data copyin(dL.pextents[:dL.prank],dL.pstrides[:dL.prank],dU)
Generating enter data create(dU.pdata[:dU.pdatalength])
Generating enter data copyin(dU.pextents[:dU.prank],dU.pstrides[:dU.prank],step_size)
Generating present(dA,dL,dU,step_size)
Generating NVIDIA GPU code
3267, Generating update self(dU.pdata[:dU.pdatalength],dL.pdata[:dL.pdatalength])
Generating exit data delete(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dL,dL.pdata[:dL.pdatalength],dL.pextents[:dL.prank],dA.pstrides[:dA.prank],dU,dU.pdata[:dU.pdatalength],dU.pextents[:dU.prank],dL.pstrides[:dL.prank],step_size,dU.pstrides[:dU.prank])
void qr_decomposition<double, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters, unsigned long, bool):
14, include “mdspan_acc.h”
3399, Generating enter data copyin(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dA.pstrides[:dA.prank],dQ)
Generating enter data create(dQ.pdata[:dQ.pdatalength])
Generating enter data copyin(dQ.pextents[:dQ.prank],dQ.pstrides[:dQ.prank],dR)
Generating enter data create(dR.pdata[:dR.pdatalength])
Generating enter data copyin(dR.pextents[:dR.prank],dR.pstrides[:dR.prank],step_size)
Generating present(dA,dQ,dR,step_size)
Generating NVIDIA GPU code
CUDA shared memory used for _T438_19957
3418, Generating update self(dR.pdata[:dR.pdatalength],dQ.pdata[:dQ.pdatalength])
Generating exit data delete(dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],dQ,dQ.pdata[:dQ.pdatalength],dQ.pextents[:dQ.prank],dA.pstrides[:dA.prank],dR,dR.pdata[:dR.pdatalength],dR.pextents[:dR.prank],dQ.pstrides[:dQ.prank],step_size,dR.pstrides[:dR.prank])
bool strassen_multiply<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters const&):
14, include “mdspan_acc.h”
bool winograd_multiply<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&, matrix_multiplication_parameters const&):
14, include “mdspan_acc.h”
void gpu_cholesky_decomposition(datastruct&, datastruct&, double
, unsigned long):
14, include “mdspan_acc.h”
1726, Generating NVIDIA GPU code
1759, #pragma acc loop vector /* threadIdx.x /
1768, #pragma acc loop seq
1787, #pragma acc loop worker, vector collapse(2) /
threadIdx.y threadIdx.x /
1789, /
threadIdx.y threadIdx.x collapsed /
1801, #pragma acc loop worker, vector /
threadIdx.y threadIdx.x /
Generating reduction(+:temp)
1812, #pragma acc loop worker /
threadIdx.y /
1816, #pragma acc loop vector /
threadIdx.x /
Generating reduction(+:temp2)
1818, Vector barrier inserted for vector loop reduction
1759, Loop is parallelizable
1787, Loop is parallelizable
1789, Loop is parallelizable
1801, Loop is parallelizable
1812, Loop is parallelizable
1816, Loop is parallelizable
void gpu_matrix_multiply_dot_w(datastruct&, datastruct&, datastruct&):
14, include “mdspan_acc.h”
1655, Generating NVIDIA GPU code
1661, #pragma acc loop worker collapse(2) /
threadIdx.y /
1663, /
threadIdx.y collapsed /
1667, #pragma acc loop vector /
threadIdx.x /
Generating reduction(+:sum)
1669, Vector barrier inserted for vector loop reduction
1661, Loop is parallelizable
1663, Loop is parallelizable
1667, Loop is parallelizable
bool matrix_multiply_dot<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&, bool):
14, include “mdspan_acc.h”
3601, Generating enter data copyin(dB.pstrides[:dB.prank],dC)
Generating enter data create(dC.pdata[:dC.pdatalength])
Generating enter data copyin(dC.pextents[:dC.prank],dA.pstrides[:dA.prank],dB,dB.pdata[:dB.pdatalength],dB.pextents[:dB.prank],cols,dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],rows,inner_dim,dC.pstrides[:dC.prank])
Generating present(cols,dA,dB,dC,rows,inner_dim)
Generating NVIDIA GPU code
3621, #pragma acc loop gang collapse(2) /
blockIdx.x /
3623, /
blockIdx.x collapsed /
3627, #pragma acc loop vector(128) /
threadIdx.x /
Generating reduction(+:sum)
3627, Loop is parallelizable
3648, Generating update self(dC.pdata[:dC.pdatalength])
Generating exit data delete(dC,dC.pdata[:dC.pdatalength],dC.pextents[:dC.prank],dB.pstrides[:dB.prank],dB,dB.pdata[:dB.pdatalength],dB.pextents[:dB.prank],dA.pstrides[:dA.prank],dA,dA.pdata[:dA.pdatalength],dA.pextents[:dA.prank],cols,rows,inner_dim,dC.pstrides[:dC.prank])
void gpu_lu_decomposition(datastruct&, datastruct&, datastruct&, double
, unsigned long):
14, include “mdspan_acc.h”
1839, Generating NVIDIA GPU code
1869, #pragma acc loop worker, vector /* threadIdx.y threadIdx.x /
1877, #pragma acc loop seq
1895, #pragma acc loop worker, vector collapse(2) /
threadIdx.y threadIdx.x /
1897, /
threadIdx.y threadIdx.x collapsed /
1906, #pragma acc loop worker /
threadIdx.y /
1910, #pragma acc loop vector /
threadIdx.x /
Generating reduction(+:temp)
1912, Vector barrier inserted for vector loop reduction
1918, #pragma acc loop worker /
threadIdx.y /
1922, #pragma acc loop vector /
threadIdx.x /
Generating reduction(+:temp)
1924, Vector barrier inserted for vector loop reduction
1869, Loop is parallelizable
1895, Loop is parallelizable
1897, Loop is parallelizable
1906, Loop is parallelizable
1910, Loop is parallelizable
1918, Loop is parallelizable
1922, Loop is parallelizable
NVC+±S-1065-Unsupported nested compute construct in compute construct or acc routine (/home/benni/projects/arraylibrary/main_acc.cpp: 1988)
double gpu_dot_product_s(datastruct, datastruct):
14, include “mdspan_acc.h”
2383, Generating acc routine seq
Generating NVIDIA GPU code
double dot_product<double, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&):
14, include “mdspan_acc.h”
bool matrix_add<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&):
14, include “mdspan_acc.h”
bool matrix_subtract<double, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>, std::vector<unsigned long, std::allocator>>(mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>> const&, mdspan<double, std::vector<unsigned long, std::allocator>>&):
14, include “mdspan_acc.h”
datastruct::datastruct(double
, unsigned long, bool, unsigned long, unsigned long*, unsigned long*, bool, bool):
14, include “mdspan_acc.h”
321, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::datastruct(double*, unsigned long, bool, unsigned long, unsigned long, unsigned long*, unsigned long*, bool, bool):
14, include “mdspan_acc.h”
351, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::operator()(unsigned long, unsigned long):
14, include “mdspan_acc.h”
228, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::operator()(unsigned long):
14, include “mdspan_acc.h”
213, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::subspanmatrix(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long*, unsigned long*, double*):
14, include “mdspan_acc.h”
506, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::transpose(unsigned long*, unsigned long*):
14, include “mdspan_acc.h”
267, Generating acc routine seq
Generating NVIDIA GPU code
datastruct::column(unsigned long, unsigned long*):
14, include “mdspan_acc.h”
587, Generating acc routine seq
Generating NVIDIA GPU code
mdspan<double, std::vector<unsigned long, std::allocator>>::subspan(std::vector<unsigned long, std::allocator> const&, std::vector<unsigned long, std::allocator> const&, double*) const:
14, include “mdspan_acc.h”
mdspan<double, std::vector<unsigned long, std::allocator>>::subspanmatrix(unsigned long, unsigned long, unsigned long, unsigned long, double*) const:
14, include “mdspan_acc.h”
__gnu_cxx::__promote<decltype((__gnu_cxx::__promote<unsigned long, std::__is_integer::__value>::__type)(0) + (__gnu_cxx::__promote<double, std::__is_integer::__value>::__type)(0)), std::__is_integer<decltype((__gnu_cxx::__promote<unsigned long, std::__is_integer::__value>::__type)(0) + (__gnu_cxx::__promote<double, std::__is_integer::__value>::__type)(0))>::__value>::__type std::pow<unsigned long, double>(unsigned long, double):
14, include “mdspan_acc.h”
NVC++/x86-64 Linux 25.1-0: compilation completed with severe errors

hm, using pure cuda may be hopefully a bit more stable?

Maybe I can circumvent all these problems with the mapper and those loops by calling cuda directly?

(but then i am no longer device independent)…

and by the way, the algorithms which i tried to implement there are here:

As you can see they are a bit more complex than just needing three nested loops, but instead several matrix multiplications inside a loop, dot products inside a loop, submatrix extractions and so on. In a more complex program, this function would e.g. be called several times by a minimizer which wants to compute stationary values of something.

and another thing: I would like these large pragmas:

//#pragma acc enter data copyin(dA)
//#pragma acc enter data copyin(dA.pdata[0:dA.pdatalength])
//#pragma acc enter data copyin(dA.pextents[0:dA.prank])
//#pragma acc enter data copyin(dA.pstrides[0:dA.prank])
//#pragma acc enter data copyin(dL)
//#pragma acc enter data create(dL.pdata[0:dL.pdatalength])
//#pragma acc enter data copyin(dL.pextents[0:dL.prank])
//#pragma acc enter data copyin(dL.pstrides[0:dL.prank])

to be put into a function. However, when i create a function like this:

template <typename T, typename Container>

void mdspan<T, Container>::create_in_object(){
if (this->pis_associated==false)
{
// Map the datastruct object itself
pdatastruct_dev=(datastruct*)acc_malloc(sizeof(datastruct));
acc_map_data(&pdatastruct,pdatastruct_dev, sizeof(datastruct));

// Allocate memory on the device for pdata, pextents, and pstrides in dA
pdata_dev    = (T*)     acc_malloc(pdatastruct.pdatalength * sizeof(T));
pextents_dev = (size_t*)acc_malloc(pdatastruct.prank * sizeof(size_t));
pstrides_dev = (size_t*)acc_malloc(pdatastruct.prank * sizeof(size_t));

// Copy data to device for dA
acc_memcpy_to_device(pdata_dev,    pdatastruct.pdata,    pdatastruct.pdatalength * sizeof(T));
acc_memcpy_to_device(pextents_dev, pdatastruct.pextents, pdatastruct.prank * sizeof(size_t));
acc_memcpy_to_device(pstrides_dev, pdatastruct.pstrides, pdatastruct.prank * sizeof(size_t));

// Map host pointers to device pointers for dA
acc_map_data(pdatastruct.pdata, pdata_dev, pdatastruct.pdatalength * sizeof(T));
acc_map_data(pdatastruct.pextents, pextents_dev, pdatastruct.prank * sizeof(size_t));
acc_map_data(pdatastruct.pstrides, pstrides_dev, pdatastruct.prank * sizeof(size_t));
pis_associated=true;
}

and give gpu_cholesky decomposition a pointer to the mapped dL=L.pdata_dev, and similarly with dL=L.pdata_dev, and change the acc statement where gpu_cholesky decomposition is called into deviceptr(dA,dL, buffer) then i will also get a device sigsev, even after changing the signature and calls of gpu_cholesky_decomposition into pointers instead of references…

This is unfortunate, because, since it is a larger struct, i would really have a member function that does this mapping in a single call…

With clang, not even these pragmas succeed, unfortunately… I do not know why…

I can make this problem with the mapper more concrete.
In the function cholesky_decomposition, i can replace these statements:

#pragma acc enter data copyin(dA)
#pragma acc enter data copyin(dA.pdata[0:dA.pdatalength])
#pragma acc enter data copyin(dA.pextents[0:dA.prank])
#pragma acc enter data copyin(dA.pstrides[0:dA.prank])
#pragma acc enter data copyin(dL)
#pragma acc enter data create(dL.pdata[0:dL.pdatalength])
#pragma acc enter data copyin(dL.pextents[0:dL.prank])
#pragma acc enter data copyin(dL.pstrides[0:dL.prank])

with these
A.create_in_object();
datastructdA=A.pdatastruct;
datastructdL=L.pdatastruct;

    acc_copyin(&dA,sizeof(datastruct<T>));
    acc_copyin(dA.pdata, dA.pdatalength * sizeof(T));
    acc_copyin(dA.pextents,dA.prank * sizeof(size_t));
    acc_copyin(dA.pstrides,dA.prank * sizeof(size_t));
    
    acc_copyin(&dL,sizeof(datastruct<T>));
    acc_create(dL.pdata, dL.pdatalength * sizeof(T));
    acc_copyin(dL.pextents,dL.prank * sizeof(size_t));
    acc_copyin(dL.pstrides,dL.prank * sizeof(size_t));

this will work if i add
#pragma acc parallel present(dA,dA.pdata,dA.pextents,dA.pstrides,dL,dL.pdata,dL.pextents,dL.pstrides,step_size) deviceptr(buffer)

however, if I put all the copyin statements into a separate function, i get a device sigsev…
and i also get a device sigsev if i explicitely replace them with acc_malloc and acc_map_data and indicate that i give devicepointers to the call where i want to start the device function… i then still get a sigsev… only those pragmas in front of the call work and copy the struct… i dont understand why the function calls and devicepointers do not work…

By the way, you can freely play around with my code from github if it helps to improve openacc compilers. In contrast to a more simple testcase, it is a struct with methods and many fields, vector statements, sub array allocation , pointer arithmetic, and many methods, which are uploaded… so it can be seen as a more advanced test case…

OpenACC has three dimensions of parallelism, gang, worker, and vector, while OpenMP has two, teams and parallel.

The parallel dimensions are in a hierarchy where gangs can call workers or vectors, workers can call vectors. All can call sequential routines. Though each can’t call others of the same level (except “seq”) or a higher level. Calling the same type is effectively splitting one parallel dimension into multiple, which isn’t supported by the model.

I still encourage you to take a look at using cuSolver. Like you, I do prefer portability over performance, but this is one area I’d use libraries. Granted this was about 10 years ago, but when I did an LU decomposition, while it worked, the performance was very poor. These types of algorithms are not easily parallelized, except for the gemm operation. Though the library versions are highly optimized so can dramatically improve performance. Also since these are very common, so other vendors may have similar libraries. I’m not sure about Intel, but AMD has hipSolver.

The rest of your post seems to be issues surrounding deep-copy, which I answered in your other post.

Hi first, openmp also has simd instructions, so you have teams, parallel and simd in openmp.

as for that segfault in the qr decomposition:

Well, one has a worker loop there, which crashes if it calls a vector function.
But exactly that call would be allowed by the openmp model.

Furthermore, these worker functions which are called from the worker function are inlined. So one basically has in the code then just a series of openacc worker loops which call vector functions. I see no reason why this should not work, but it does not, when optimizing larger than -O1.

Furthermore, apparently, even -O1 fails when I call some vector functions from a worker loop.
It wants to have the vector functions sequentialized… and that is strange.

As for the memory problems. thanks, with the attach clause, they are solved now.

to show this again. in the qr decomposition:

The function gpu_qr_decomposition is annotated as worker.

There is a non parallelizable loop in there. And within it, is this function:

T norm = sqrt(gpu_dot_product_s(v,v));

this is designated as seq. and it works with -O1, but if we change that to a vector function, and use -O1, the code crashes.

A worker function should be able to call a vector function. therefore, that sigsev makes no sense to me…

(i have seen nvcc ignoring my pragmas, i e. it tries to parallelize loops which i said it should not.) perhaps it is that?

I now have done more debugging.

I have placed the code of that vectorized dot product directly into gpu_qr_decomposition… and ive added some printfs.

printvector(v);

const size_t o=v.pextents[0];

T result = 0;

const size_t v1s=v.pstrides[0];


printf("vstrides\n");

printf("%lu\n",v1s);

printf("vextents\n");

printf("%lu\n",o);

#pragma acc loop vector reduction(+:result)
for (size_t i = 0; i < o; ++i)
{
result +=v(i)*v(i);
}

printf("result from sqrt dot product as inside code, vector loop\n");

printf("%f\n", sqrt(result));

T norm = sqrt(gpu_dot_product_v(v,v));

printf("result vom sqrt dot product as vector function in the worker loop\n");
printf("%f\n", norm);

this prints out the following:

Entirely on gpu
printed vector v
12.000000
6.000000
-4.000000
vstrides
3
vextents
3
result from sqrt dot product as inside code, vector loop
14.000000
Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
File: /home/benni/projects/arraylibrary/mdspan_acc.h
Function: _Z16qr_decompositionIdSt6vectorImSaImEEEvR6mdspanIT_T0_ES7_S7_32matrix_multiplication_parametersmb:3382
Line: 3397

So from within gpu_qr decomposition, i can write a vector loop for a dot product in plain text and that works.

But I can not compute this dot product within a function designated as vector even if that function computes exactly the same code and contains the same vector loop.

The compiler output for that loop is of course:
2091, Loop is parallelizable
But also for that vector dot product, whose code is this:

#pragma acc routine vector
template
inline T gpu_dot_product_v(const datastruct vec1, const datastruct vec2)
{
const size_t n=vec1.pextents[0];
T result = 0;
#pragma acc loop vector reduction(+:result)
for (size_t i = 0; i < n; ++i)
{
result += vec1(i) * vec2(i);
}
return result;
}

that behavior of nvcc makes no sense.

The same code works when i just insert it as plaintext into gpu_qr_decomposition. And if put into a separate function it crashes

What is notable is that this is a vector product with itself. perhaps the compiler does then optimize something away? or it has problems with multiply add optimizations? or whatever…

I highly suspect this is a bug…
Whatever this is, that looks like a compiler problem to me.

The problem has nothing to do with the operators, it seems. I can copy their content inside the dot product
the, which I think errouneous behavior is the same). if outsourced into a vector function, the vector loop of the dot product fails despite having the same code and despite being designated as inline!

Now I know what is going on in my case. I made my code now such that the operators() need the strides to be given and put them as consts before the loop. this enabled further oprimization, and I had compiler output.

According to the open_acc standard, of course, a worker can! and is allowed to call a function with a worker loop.

The worker clause specifies that the procedure contains contain, or may call another procedure that contains a loop with a worker clause, but does not contain nor does it call another procedure that contains a loop with the gang clause.

And nvc++ honors this.

At the end of the qr decomposition, there appears a matrix multiplication at the end outside of a loop, and i can designate this as worker inside the worker function, and it just works.

However, inside the qr decomposition is something else: a loop that has to be processed sequentially. With the strides now removed as dependencies, nvcc indeed complains, when i make the matrix multiplications inside of this sequential loop as workers. Because they are inside a sequential loop… and from a sequential procedure, one can not get into a worker.

However, apparently, inside this sequential loop, i can write in plain vector and parallel loops and he says these would be parallelizable. It is a bit strange that it does not then acknowledge to call worker procedures from this.

Also, the other output is a bit strange: for example this loop:

 const size_t h=c;

const size_t pstrv0=pstrv[0];
const size_t qstr0=Q.pstrides[0];
const size_t qstr1=Q.pstrides[1];
#pragma acc loop vector
for (size_t i = 0; i < n; ++i)
{
Q(i,h,qstr0,qstr1) = v(i,pstrv0);
}
}
and with this operator:
#pragma acc routine seq
template
attribute((always_inline)) inline T& datastruct::operator()(const size_t row, const size_t col, const size_t strides0, const size_t strides1)
{
return pdata[row * strides0 + col * strides1];
}
and similarly for the vector. And what does nvc++ say:

nvc++ returns:

, Loop is parallelizable
Loop not vectorized: data dependency

I wrote: vectorize. And thus it should vectorize. Since there is no data dependencies, the apparent side effects are none because they are const, and the pdata fields are different.

That it claims there would be side effects happens often. Also with openmp code where I know others can vectorize. since vector loops have no overhead for parallel threads, no vectorization means a 5-10x speed decrease…

But well, at least the reason for these crashes is now resolved. The compiler had apparent problems calling worker functions designated as workers with worker and vector loops from a sequential loop inside of a worker function…