Nvc++: undefined __kmpc_for_static_init_16 and Unexpected branch type

Dear all,
I’m continuing my quest on getting the cartesian product example of Bruce A Lelbach running.
I implemented it in the following way:

``````#include <algorithm>
#include <iostream>
#include <ranges>
#include <execution>

#ifndef D_Vector
#include <valarray>
#else
#include <vector>
#endif

#include <experimental/mdspan>

namespace stdex = std::experimental;
namespace execution = std::execution;
namespace stdv = std::views;

int main()
{
constexpr int N = 1000;
constexpr int M = 1000;
constexpr int O = 100;

#ifndef D_Vector
std::valarray<double> input(1, N * M * O);
std::valarray<double> output(N * M * O);

stdex::mdspan A{std::begin(input),  N, M, O};
stdex::mdspan B{std::begin(output), N, M, O};
#else
std::vector<double> input(N * M * O, 1);
std::vector<double> output(N * M * O);

// stdex::mdspan<double, stdex::dextents<2>> A{input.begin(), N, M, O};
stdex::mdspan A{input.begin(), N, M, O};
stdex::mdspan B{output.begin(), N, M, O};
#endif

A(1, 1, 1) = 2;

auto v = stdv::cartesian_product(
std::ranges::views::iota(1ul, A.extent(0) - 1),
std::ranges::views::iota(1ul, A.extent(1) - 1),
std::ranges::views::iota(1ul, A.extent(2) - 1));

std::for_each(execution::par_unseq,
std::begin(v),
std::end(v),
[=] (auto idx)
{
auto [i, j, k] = idx;
B(i, j, k) = (A(i, j, k-1) + A(i, j, k+1)
+ A(i-1, j, k) + A(i+1, j, k)
+ A(i, j-1, k) + A(i, j, k) + A(i, j+1, k)) / 7;
});

std::cout << B(1, 1, 1) << ' ' << A(1, 1, 1) << std::endl;
}
``````

Compiling with (note, I’m using nvc++ from hpc_sdk 24.3 and a conda installed g++ 13.2)

``````nvc++ cp1.C -O2 --std=c++23  --gcc-toolchain=\${CONDA_PREFIX}/bin/gcc -L \${CONDA_PREFIX}/x86_64-conda-linux-gnu/lib -o cp1 -L  \${CONDA_PREFIX}/lib   && ./cp1
``````

everything looks fine. It compiles and runs.

Adding `-stdpar=multicore` as compile flag results in

``````/usr/bin/ld: /tmp/nvc++dd_6jmN0pH6I.o: in function `__nv__ZN6thrust6system3omp6detail10for_each_nINS2_5par_tENSt6ranges22cartesian_product_viewINS5_9iota_viewImmEEJS8_S8_EE9_IteratorILb1EEEnZ4mainEUlT_E_EET0_RNS2_16execution_policyISC_EESE_T1_T2__F786L74_1':
\${HOME}/NVidia/hpc_sdk/Linux_x86_64/24.3/compilers/include-stdpar/thrust/system/omp/detail/for_each.inl:74: undefined reference to `__kmpc_for_static_init_16'
``````

where I replaced my home directory with \$HOME. Any hints which library I need to link? I scanned all installed library but couldn’t find a suitable one.

In addition, when using `-stdpar=gpu` compilations aborts with

``````NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected branch type (cp1.C: 432)
NVC++/x86-64 Linux 24.3-0: compilation aborted
``````

And finally, when trying to use `std::vector`with `-DD_Vector`I’m flooded with error messages, starting with

``````"cp1.C", line 35: error: cannot deduce class template arguments
stdex::mdspan A{input.begin(), N, M, O};
``````

Any help is appreciated,
best regards,
Peter

Hi Peter,

I talked with engineering and these are known limitations. The issue is with “itoa” when passing 64-ints. Under the hood “itoa” changes these to 128-bit ints which we don’t handle well.

The work around is to use 32-bit ints instead, i.e. change “1ul” to “1u” as iota’s value.

Hope this helps,
Mat

Dear Mat,
thanks a lot. the multicore and gpu version now do compile. I had added `1ul`because I had compilation issues without, but I didn’t try just `1u`instead of `1`.
The multicore version now runs fine.
The gpu version crashes:

``````mdspan: nvc++ cp1.C -O2  --std=c++23  -stdpar=gpu  --gcc-toolchain=\${CONDA_PREFIX}/bin/gcc -L \${CONDA_PREFIX}/x86_64-conda-linux-gnu/lib -o cp1 -L  \${CONDA_PREFIX}/lib   && time ./cp1
/usr/bin/ld: warning: /tmp/pgcudafatYWWibwmTUHLNr.o: missing .note.GNU-stack section implies executable stack
/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
terminate called after throwing an instance of 'thrust::THRUST_200300_86_NS::system::system_error'
what():  parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted (core dumped)
``````

I have a Core i7-13700H notebook with a RTX A1000 6GB Laptop GPU.
Using rather small dimension didn’t help.

And the version with `-DD_Vector`still doesn’t compile`.

All the best,
Peter

Engineering helped me again since this a new area for me.

The problem with the GPU version is because the GNU STL cartesian_product iterators have pointers to the cartesian_product_view object, which lives on the stack and can’t be accessed by GPU code. The work around (see below) is to allocate the cartesian_product so it can be put into CUDA Unified Memory.

For the “D_Vector” version, this is an error in your code. The deduction guides for mdspan require that the data pointer be an actual pointer. `input.begin()` , when `input` is a std::vector, is an iterator, not a pointer, so the mdspan deduction guide doesn’t kick in and the compiler can’t deduce the template arguments for mdspan. To fix, change `input.begin()` and `output.begin()` to `input.data()` and `output.data()` .

Here’s the working version:

``````#include <algorithm>
#include <iostream>
#include <ranges>
#include <execution>

#ifndef D_Vector
#include <valarray>
#else
#include <vector>
#endif

#include <experimental/mdspan>

namespace stdex = std::experimental;
namespace execution = std::execution;
namespace stdv = std::views;

int main()
{
constexpr int N = 100;
constexpr int M = 100;
constexpr int O = 100;

#ifndef D_Vector
std::valarray<double> input(1, N * M * O);
std::valarray<double> output(N * M * O);

stdex::mdspan A{std::begin(input),  N, M, O};
stdex::mdspan B{std::begin(output), N, M, O};
#else
std::vector<double> input(N * M * O, 1);
std::vector<double> output(N * M * O);

// stdex::mdspan<double, stdex::dextents<2>> A{input.begin(), N, M, O};
stdex::mdspan A{input.data(), N, M, O};
stdex::mdspan B{output.data(), N, M, O};
#endif

A(1, 1, 1) = 2;

auto *vp = new std::ranges::cartesian_product_view(
std::ranges::views::iota(1u, A.extent(0) - 1),
std::ranges::views::iota(1u, A.extent(1) - 1),
std::ranges::views::iota(1u, A.extent(2) - 1));
auto& v = *vp;

std::for_each(execution::par_unseq,
std::begin(v),
std::end(v),
[=] (auto idx)
{
auto [i, j, k] = idx;
B(i, j, k) = (A(i, j, k-1) + A(i, j, k+1)
+ A(i-1, j, k) + A(i+1, j, k)
+ A(i, j-1, k) + A(i, j, k) + A(i, j+1, k)) / 7;
});

std::cout << B(1, 1, 1) << ' ' << A(1, 1, 1) << std::endl;
}
``````
``````% nvc++ --std=c++23 --gcc-toolchain=/home/sw/thirdparty/gcc/gcc-13.2.0/Linux_x86_64 -stdpar=multicore test.cpp ; a.out
1.14286 2
% nvc++ --std=c++23 --gcc-toolchain=/home/sw/thirdparty/gcc/gcc-13.2.0/Linux_x86_64 -stdpar=gpu test.cpp ; a.out
1.14286 2
% nvc++ --std=c++23 --gcc-toolchain=/home/sw/thirdparty/gcc/gcc-13.2.0/Linux_x86_64 -stdpar=gpu -DD_Vector test.cpp ; a.out
1.14286 2
``````

Dear Mat,
indeed this works fine, thanks a lot.

I did some benchmarking and realized that the multicore version is significantly faster, so I guess the gpu version is slowed down by the data transfer.

As a side note, the valarray gets slightly faster created & filled than the std::vector version.

Best regards,
Peter

`so I guess the gpu version is slowed down by the data transfer.`

Looking at the nsys profile, it’s mostly due to the allocation of managed memory. Not unexpected since the code has little compute or reuse of the data so the overhead dominates.

``````
** CUDA API Summary (cuda_api_sum):

Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)           Name
--------  ---------------  ---------  ------------  ------------  ----------  ----------  -----------  ----------------------
71.3       20,867,129          1  20,867,129.0  20,867,129.0  20,867,129  20,867,129          0.0  cuMemAllocManaged
23.4        6,831,518          1   6,831,518.0   6,831,518.0   6,831,518   6,831,518          0.0  cudaStreamSynchronize
4.1        1,193,802          1   1,193,802.0   1,193,802.0   1,193,802   1,193,802          0.0  cuMemAllocHost_v2
0.5          140,270        409         343.0         290.0          90      11,100        557.6  cuGetProcAddress_v2
0.4          110,511          1     110,511.0     110,511.0     110,511     110,511          0.0  cuMemAlloc_v2
0.3          100,231          1     100,231.0     100,231.0     100,231     100,231          0.0  cudaLaunchKernel
0.0            1,840          4         460.0         305.0         130       1,100        436.1  cuCtxSetCurrent
0.0            1,360          1       1,360.0       1,360.0       1,360       1,360          0.0  cuInit

Processing [report6.sqlite] with [/proj/nv/Linux_x86_64/249847-dev/profilers/Nsight_Systems/host-linux-x64/reports/cuda_gpu_kern_sum.py]...

** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):

Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name
--------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
100.0        6,834,133          1  6,834,133.0  6,834,133.0  6,834,133  6,834,133          0.0  void thrust::THRUST_200300_90_NS::cuda_cub::core::_kernel_agent<thrust::THRUST_200300_90_NS::cuda_c…

Processing [report6.sqlite] with [/proj/nv/Linux_x86_64/249847-dev/profilers/Nsight_Systems/host-linux-x64/reports/cuda_gpu_mem_time_sum.py]...

** CUDA GPU MemOps Summary (by Time) (cuda_gpu_mem_time_sum):

Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)               Operation
--------  ---------------  -----  --------  --------  --------  --------  -----------  ------------------------------------
99.2        4,258,635    562   7,577.6   4,159.0     2,911   193,474     15,723.5  [CUDA memcpy Unified Host-to-Device]
0.8           35,196      8   4,399.5   1,535.5     1,023    23,809      7,869.6  [CUDA memcpy Unified Device-to-Host]

Processing [report6.sqlite] with [/proj/nv/Linux_x86_64/249847-dev/profilers/Nsight_Systems/host-linux-x64/reports/cuda_gpu_mem_size_sum.py]...

** CUDA GPU MemOps Summary (by Size) (cuda_gpu_mem_size_sum):

Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)               Operation
----------  -----  --------  --------  --------  --------  -----------  ------------------------------------
16.974    562     0.030     0.008     0.004     1.004        0.093  [CUDA memcpy Unified Host-to-Device]
0.262      8     0.033     0.016     0.004     0.106        0.037  [CUDA memcpy Unified Device-to-Host]
``````

Thanks, makes sense to me.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.