Nvc++: undefined __kmpc_for_static_init_16 and Unexpected branch type

Dear all,
I’m continuing my quest on getting the cartesian product example of Bruce A Lelbach running.
I implemented it in the following way:

#include <algorithm>
#include <iostream>
#include <ranges>
#include <execution>

#ifndef D_Vector
    #include <valarray>
#else
    #include <vector>
#endif

#include <experimental/mdspan>

namespace stdex = std::experimental;
namespace execution = std::execution;
namespace stdv = std::views;

int main()
{
    constexpr int N = 1000;
    constexpr int M = 1000;
    constexpr int O = 100;

#ifndef D_Vector
    std::valarray<double> input(1, N * M * O);
    std::valarray<double> output(N * M * O);

    stdex::mdspan A{std::begin(input),  N, M, O};
    stdex::mdspan B{std::begin(output), N, M, O};
#else
    std::vector<double> input(N * M * O, 1);
    std::vector<double> output(N * M * O);

    // stdex::mdspan<double, stdex::dextents<2>> A{input.begin(), N, M, O};
    stdex::mdspan A{input.begin(), N, M, O};
    stdex::mdspan B{output.begin(), N, M, O};
#endif

    A(1, 1, 1) = 2;

    auto v = stdv::cartesian_product(
        std::ranges::views::iota(1ul, A.extent(0) - 1),
        std::ranges::views::iota(1ul, A.extent(1) - 1),
        std::ranges::views::iota(1ul, A.extent(2) - 1));

    std::for_each(execution::par_unseq,
                  std::begin(v),
                  std::end(v),
                  [=] (auto idx)
                  {
                      auto [i, j, k] = idx;
                      B(i, j, k) = (A(i, j, k-1) + A(i, j, k+1)
                                  + A(i-1, j, k) + A(i+1, j, k)
                                  + A(i, j-1, k) + A(i, j, k) + A(i, j+1, k)) / 7;
                  });

    std::cout << B(1, 1, 1) << ' ' << A(1, 1, 1) << std::endl;
}

Compiling with (note, I’m using nvc++ from hpc_sdk 24.3 and a conda installed g++ 13.2)

nvc++ cp1.C -O2 --std=c++23  --gcc-toolchain=${CONDA_PREFIX}/bin/gcc -L ${CONDA_PREFIX}/x86_64-conda-linux-gnu/lib -o cp1 -L  ${CONDA_PREFIX}/lib   && ./cp1

everything looks fine. It compiles and runs.

Adding -stdpar=multicore as compile flag results in

/usr/bin/ld: /tmp/nvc++dd_6jmN0pH6I.o: in function `__nv__ZN6thrust6system3omp6detail10for_each_nINS2_5par_tENSt6ranges22cartesian_product_viewINS5_9iota_viewImmEEJS8_S8_EE9_IteratorILb1EEEnZ4mainEUlT_E_EET0_RNS2_16execution_policyISC_EESE_T1_T2__F786L74_1':
${HOME}/NVidia/hpc_sdk/Linux_x86_64/24.3/compilers/include-stdpar/thrust/system/omp/detail/for_each.inl:74: undefined reference to `__kmpc_for_static_init_16'

where I replaced my home directory with $HOME. Any hints which library I need to link? I scanned all installed library but couldn’t find a suitable one.

In addition, when using -stdpar=gpu compilations aborts with

NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected branch type (cp1.C: 432)
NVC++/x86-64 Linux 24.3-0: compilation aborted

And finally, when trying to use std::vectorwith -DD_VectorI’m flooded with error messages, starting with

"cp1.C", line 35: error: cannot deduce class template arguments
      stdex::mdspan A{input.begin(), N, M, O};

Any help is appreciated,
best regards,
Peter

Hi Peter,

I talked with engineering and these are known limitations. The issue is with “itoa” when passing 64-ints. Under the hood “itoa” changes these to 128-bit ints which we don’t handle well.

The work around is to use 32-bit ints instead, i.e. change “1ul” to “1u” as iota’s value.

Hope this helps,
Mat

Dear Mat,
thanks a lot. the multicore and gpu version now do compile. I had added 1ulbecause I had compilation issues without, but I didn’t try just 1uinstead of 1.
The multicore version now runs fine.
The gpu version crashes:

mdspan: nvc++ cp1.C -O2  --std=c++23  -stdpar=gpu  --gcc-toolchain=${CONDA_PREFIX}/bin/gcc -L ${CONDA_PREFIX}/x86_64-conda-linux-gnu/lib -o cp1 -L  ${CONDA_PREFIX}/lib   && time ./cp1
/usr/bin/ld: warning: /tmp/pgcudafatYWWibwmTUHLNr.o: missing .note.GNU-stack section implies executable stack
/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
terminate called after throwing an instance of 'thrust::THRUST_200300_86_NS::system::system_error'
  what():  parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted (core dumped)

I have a Core i7-13700H notebook with a RTX A1000 6GB Laptop GPU.
Using rather small dimension didn’t help.

And the version with -DD_Vectorstill doesn’t compile`.

All the best,
Peter

Engineering helped me again since this a new area for me.

The problem with the GPU version is because the GNU STL cartesian_product iterators have pointers to the cartesian_product_view object, which lives on the stack and can’t be accessed by GPU code. The work around (see below) is to allocate the cartesian_product so it can be put into CUDA Unified Memory.

For the “D_Vector” version, this is an error in your code. The deduction guides for mdspan require that the data pointer be an actual pointer. input.begin() , when input is a std::vector, is an iterator, not a pointer, so the mdspan deduction guide doesn’t kick in and the compiler can’t deduce the template arguments for mdspan. To fix, change input.begin() and output.begin() to input.data() and output.data() .

Here’s the working version:

#include <algorithm>
#include <iostream>
#include <ranges>
#include <execution>

#ifndef D_Vector
    #include <valarray>
#else
    #include <vector>
#endif

#include <experimental/mdspan>

namespace stdex = std::experimental;
namespace execution = std::execution;
namespace stdv = std::views;

int main()
{
    constexpr int N = 100;
    constexpr int M = 100;
    constexpr int O = 100;

#ifndef D_Vector
    std::valarray<double> input(1, N * M * O);
    std::valarray<double> output(N * M * O);

    stdex::mdspan A{std::begin(input),  N, M, O};
    stdex::mdspan B{std::begin(output), N, M, O};
#else
    std::vector<double> input(N * M * O, 1);
    std::vector<double> output(N * M * O);

    // stdex::mdspan<double, stdex::dextents<2>> A{input.begin(), N, M, O};
    stdex::mdspan A{input.data(), N, M, O};
    stdex::mdspan B{output.data(), N, M, O};
#endif

    A(1, 1, 1) = 2;

    auto *vp = new std::ranges::cartesian_product_view(
        std::ranges::views::iota(1u, A.extent(0) - 1),
        std::ranges::views::iota(1u, A.extent(1) - 1),
        std::ranges::views::iota(1u, A.extent(2) - 1));
    auto& v = *vp;

    std::for_each(execution::par_unseq,
                  std::begin(v),
                  std::end(v),
                  [=] (auto idx)
                  {
                      auto [i, j, k] = idx;
                      B(i, j, k) = (A(i, j, k-1) + A(i, j, k+1)
                                  + A(i-1, j, k) + A(i+1, j, k)
                                  + A(i, j-1, k) + A(i, j, k) + A(i, j+1, k)) / 7;
                  });

    std::cout << B(1, 1, 1) << ' ' << A(1, 1, 1) << std::endl;
}
% nvc++ --std=c++23 --gcc-toolchain=/home/sw/thirdparty/gcc/gcc-13.2.0/Linux_x86_64 -stdpar=multicore test.cpp ; a.out
1.14286 2
% nvc++ --std=c++23 --gcc-toolchain=/home/sw/thirdparty/gcc/gcc-13.2.0/Linux_x86_64 -stdpar=gpu test.cpp ; a.out
1.14286 2
% nvc++ --std=c++23 --gcc-toolchain=/home/sw/thirdparty/gcc/gcc-13.2.0/Linux_x86_64 -stdpar=gpu -DD_Vector test.cpp ; a.out
1.14286 2

Dear Mat,
indeed this works fine, thanks a lot.

I did some benchmarking and realized that the multicore version is significantly faster, so I guess the gpu version is slowed down by the data transfer.

As a side note, the valarray gets slightly faster created & filled than the std::vector version.

Best regards,
Peter

so I guess the gpu version is slowed down by the data transfer.

Looking at the nsys profile, it’s mostly due to the allocation of managed memory. Not unexpected since the code has little compute or reuse of the data so the overhead dominates.


 ** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)           Name
 --------  ---------------  ---------  ------------  ------------  ----------  ----------  -----------  ----------------------
     71.3       20,867,129          1  20,867,129.0  20,867,129.0  20,867,129  20,867,129          0.0  cuMemAllocManaged
     23.4        6,831,518          1   6,831,518.0   6,831,518.0   6,831,518   6,831,518          0.0  cudaStreamSynchronize
      4.1        1,193,802          1   1,193,802.0   1,193,802.0   1,193,802   1,193,802          0.0  cuMemAllocHost_v2
      0.5          140,270        409         343.0         290.0          90      11,100        557.6  cuGetProcAddress_v2
      0.4          110,511          1     110,511.0     110,511.0     110,511     110,511          0.0  cuMemAlloc_v2
      0.3          100,231          1     100,231.0     100,231.0     100,231     100,231          0.0  cudaLaunchKernel
      0.0            1,840          4         460.0         305.0         130       1,100        436.1  cuCtxSetCurrent
      0.0            1,360          1       1,360.0       1,360.0       1,360       1,360          0.0  cuInit
      0.0              130          1         130.0         130.0         130         130          0.0  cuModuleGetLoadingMode

Processing [report6.sqlite] with [/proj/nv/Linux_x86_64/249847-dev/profilers/Nsight_Systems/host-linux-x64/reports/cuda_gpu_kern_sum.py]...

 ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name                                      
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
    100.0        6,834,133          1  6,834,133.0  6,834,133.0  6,834,133  6,834,133          0.0  void thrust::THRUST_200300_90_NS::cuda_cub::core::_kernel_agent<thrust::THRUST_200300_90_NS::cuda_c…

Processing [report6.sqlite] with [/proj/nv/Linux_x86_64/249847-dev/profilers/Nsight_Systems/host-linux-x64/reports/cuda_gpu_mem_time_sum.py]...

 ** CUDA GPU MemOps Summary (by Time) (cuda_gpu_mem_time_sum):

 Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)               Operation
 --------  ---------------  -----  --------  --------  --------  --------  -----------  ------------------------------------
     99.2        4,258,635    562   7,577.6   4,159.0     2,911   193,474     15,723.5  [CUDA memcpy Unified Host-to-Device]
      0.8           35,196      8   4,399.5   1,535.5     1,023    23,809      7,869.6  [CUDA memcpy Unified Device-to-Host]

Processing [report6.sqlite] with [/proj/nv/Linux_x86_64/249847-dev/profilers/Nsight_Systems/host-linux-x64/reports/cuda_gpu_mem_size_sum.py]...

 ** CUDA GPU MemOps Summary (by Size) (cuda_gpu_mem_size_sum):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)               Operation
 ----------  -----  --------  --------  --------  --------  -----------  ------------------------------------
     16.974    562     0.030     0.008     0.004     1.004        0.093  [CUDA memcpy Unified Host-to-Device]
      0.262      8     0.033     0.016     0.004     0.106        0.037  [CUDA memcpy Unified Device-to-Host]

Thanks, makes sense to me.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.