Nvc++: undefined __kmpc_for_static_init_16 and Unexpected branch type

YangBaxter · March 18, 2024, 10:28am

Dear all,
I’m continuing my quest on getting the cartesian product example of Bruce A Lelbach running.
I implemented it in the following way:

#include <algorithm>
#include <iostream>
#include <ranges>
#include <execution>

#ifndef D_Vector
    #include <valarray>
#else
    #include <vector>
#endif

#include <experimental/mdspan>

namespace stdex = std::experimental;
namespace execution = std::execution;
namespace stdv = std::views;

int main()
{
    constexpr int N = 1000;
    constexpr int M = 1000;
    constexpr int O = 100;

#ifndef D_Vector
    std::valarray<double> input(1, N * M * O);
    std::valarray<double> output(N * M * O);

    stdex::mdspan A{std::begin(input),  N, M, O};
    stdex::mdspan B{std::begin(output), N, M, O};
#else
    std::vector<double> input(N * M * O, 1);
    std::vector<double> output(N * M * O);

    // stdex::mdspan<double, stdex::dextents<2>> A{input.begin(), N, M, O};
    stdex::mdspan A{input.begin(), N, M, O};
    stdex::mdspan B{output.begin(), N, M, O};
#endif

    A(1, 1, 1) = 2;

    auto v = stdv::cartesian_product(
        std::ranges::views::iota(1ul, A.extent(0) - 1),
        std::ranges::views::iota(1ul, A.extent(1) - 1),
        std::ranges::views::iota(1ul, A.extent(2) - 1));

    std::for_each(execution::par_unseq,
                  std::begin(v),
                  std::end(v),
                  [=] (auto idx)
                  {
                      auto [i, j, k] = idx;
                      B(i, j, k) = (A(i, j, k-1) + A(i, j, k+1)
                                  + A(i-1, j, k) + A(i+1, j, k)
                                  + A(i, j-1, k) + A(i, j, k) + A(i, j+1, k)) / 7;
                  });

    std::cout << B(1, 1, 1) << ' ' << A(1, 1, 1) << std::endl;
}

Compiling with (note, I’m using nvc++ from hpc_sdk 24.3 and a conda installed g++ 13.2)

nvc++ cp1.C -O2 --std=c++23  --gcc-toolchain=${CONDA_PREFIX}/bin/gcc -L ${CONDA_PREFIX}/x86_64-conda-linux-gnu/lib -o cp1 -L  ${CONDA_PREFIX}/lib   && ./cp1

everything looks fine. It compiles and runs.

Adding -stdpar=multicore as compile flag results in

/usr/bin/ld: /tmp/nvc++dd_6jmN0pH6I.o: in function `__nv__ZN6thrust6system3omp6detail10for_each_nINS2_5par_tENSt6ranges22cartesian_product_viewINS5_9iota_viewImmEEJS8_S8_EE9_IteratorILb1EEEnZ4mainEUlT_E_EET0_RNS2_16execution_policyISC_EESE_T1_T2__F786L74_1':
${HOME}/NVidia/hpc_sdk/Linux_x86_64/24.3/compilers/include-stdpar/thrust/system/omp/detail/for_each.inl:74: undefined reference to `__kmpc_for_static_init_16'

where I replaced my home directory with $HOME. Any hints which library I need to link? I scanned all installed library but couldn’t find a suitable one.

In addition, when using -stdpar=gpu compilations aborts with

NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected branch type (cp1.C: 432)
NVC++/x86-64 Linux 24.3-0: compilation aborted

And finally, when trying to use std::vectorwith -DD_VectorI’m flooded with error messages, starting with

"cp1.C", line 35: error: cannot deduce class template arguments
      stdex::mdspan A{input.begin(), N, M, O};

Any help is appreciated,
best regards,
Peter

MatColgrove · March 18, 2024, 5:07pm

Hi Peter,

I talked with engineering and these are known limitations. The issue is with “itoa” when passing 64-ints. Under the hood “itoa” changes these to 128-bit ints which we don’t handle well.

The work around is to use 32-bit ints instead, i.e. change “1ul” to “1u” as iota’s value.

Hope this helps,
Mat

YangBaxter · March 18, 2024, 5:24pm

Dear Mat,
thanks a lot. the multicore and gpu version now do compile. I had added 1ulbecause I had compilation issues without, but I didn’t try just 1uinstead of 1.
The multicore version now runs fine.
The gpu version crashes:

mdspan: nvc++ cp1.C -O2  --std=c++23  -stdpar=gpu  --gcc-toolchain=${CONDA_PREFIX}/bin/gcc -L ${CONDA_PREFIX}/x86_64-conda-linux-gnu/lib -o cp1 -L  ${CONDA_PREFIX}/lib   && time ./cp1
/usr/bin/ld: warning: /tmp/pgcudafatYWWibwmTUHLNr.o: missing .note.GNU-stack section implies executable stack
/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
terminate called after throwing an instance of 'thrust::THRUST_200300_86_NS::system::system_error'
  what():  parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted (core dumped)

I have a Core i7-13700H notebook with a RTX A1000 6GB Laptop GPU.
Using rather small dimension didn’t help.

And the version with -DD_Vectorstill doesn’t compile`.

All the best,
Peter

MatColgrove · March 18, 2024, 6:31pm

Engineering helped me again since this a new area for me.

The problem with the GPU version is because the GNU STL cartesian_product iterators have pointers to the cartesian_product_view object, which lives on the stack and can’t be accessed by GPU code. The work around (see below) is to allocate the cartesian_product so it can be put into CUDA Unified Memory.

For the “D_Vector” version, this is an error in your code. The deduction guides for mdspan require that the data pointer be an actual pointer. input.begin() , when input is a std::vector, is an iterator, not a pointer, so the mdspan deduction guide doesn’t kick in and the compiler can’t deduce the template arguments for mdspan. To fix, change input.begin() and output.begin() to input.data() and output.data() .

Here’s the working version:

#include <algorithm>
#include <iostream>
#include <ranges>
#include <execution>

#ifndef D_Vector
    #include <valarray>
#else
    #include <vector>
#endif

#include <experimental/mdspan>

namespace stdex = std::experimental;
namespace execution = std::execution;
namespace stdv = std::views;

int main()
{
    constexpr int N = 100;
    constexpr int M = 100;
    constexpr int O = 100;

#ifndef D_Vector
    std::valarray<double> input(1, N * M * O);
    std::valarray<double> output(N * M * O);

    stdex::mdspan A{std::begin(input),  N, M, O};
    stdex::mdspan B{std::begin(output), N, M, O};
#else
    std::vector<double> input(N * M * O, 1);
    std::vector<double> output(N * M * O);

    // stdex::mdspan<double, stdex::dextents<2>> A{input.begin(), N, M, O};
    stdex::mdspan A{input.data(), N, M, O};
    stdex::mdspan B{output.data(), N, M, O};
#endif

    A(1, 1, 1) = 2;

    auto *vp = new std::ranges::cartesian_product_view(
        std::ranges::views::iota(1u, A.extent(0) - 1),
        std::ranges::views::iota(1u, A.extent(1) - 1),
        std::ranges::views::iota(1u, A.extent(2) - 1));
    auto& v = *vp;

    std::for_each(execution::par_unseq,
                  std::begin(v),
                  std::end(v),
                  [=] (auto idx)
                  {
                      auto [i, j, k] = idx;
                      B(i, j, k) = (A(i, j, k-1) + A(i, j, k+1)
                                  + A(i-1, j, k) + A(i+1, j, k)
                                  + A(i, j-1, k) + A(i, j, k) + A(i, j+1, k)) / 7;
                  });

    std::cout << B(1, 1, 1) << ' ' << A(1, 1, 1) << std::endl;
}

% nvc++ --std=c++23 --gcc-toolchain=/home/sw/thirdparty/gcc/gcc-13.2.0/Linux_x86_64 -stdpar=multicore test.cpp ; a.out
1.14286 2
% nvc++ --std=c++23 --gcc-toolchain=/home/sw/thirdparty/gcc/gcc-13.2.0/Linux_x86_64 -stdpar=gpu test.cpp ; a.out
1.14286 2
% nvc++ --std=c++23 --gcc-toolchain=/home/sw/thirdparty/gcc/gcc-13.2.0/Linux_x86_64 -stdpar=gpu -DD_Vector test.cpp ; a.out
1.14286 2

YangBaxter · March 19, 2024, 6:57am

Dear Mat,
indeed this works fine, thanks a lot.

I did some benchmarking and realized that the multicore version is significantly faster, so I guess the gpu version is slowed down by the data transfer.

As a side note, the valarray gets slightly faster created & filled than the std::vector version.

Best regards,
Peter

MatColgrove · March 19, 2024, 3:14pm

so I guess the gpu version is slowed down by the data transfer.

Looking at the nsys profile, it’s mostly due to the allocation of managed memory. Not unexpected since the code has little compute or reuse of the data so the overhead dominates.


 ** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)           Name
 --------  ---------------  ---------  ------------  ------------  ----------  ----------  -----------  ----------------------
     71.3       20,867,129          1  20,867,129.0  20,867,129.0  20,867,129  20,867,129          0.0  cuMemAllocManaged
     23.4        6,831,518          1   6,831,518.0   6,831,518.0   6,831,518   6,831,518          0.0  cudaStreamSynchronize
      4.1        1,193,802          1   1,193,802.0   1,193,802.0   1,193,802   1,193,802          0.0  cuMemAllocHost_v2
      0.5          140,270        409         343.0         290.0          90      11,100        557.6  cuGetProcAddress_v2
      0.4          110,511          1     110,511.0     110,511.0     110,511     110,511          0.0  cuMemAlloc_v2
      0.3          100,231          1     100,231.0     100,231.0     100,231     100,231          0.0  cudaLaunchKernel
      0.0            1,840          4         460.0         305.0         130       1,100        436.1  cuCtxSetCurrent
      0.0            1,360          1       1,360.0       1,360.0       1,360       1,360          0.0  cuInit
      0.0              130          1         130.0         130.0         130         130          0.0  cuModuleGetLoadingMode

Processing [report6.sqlite] with [/proj/nv/Linux_x86_64/249847-dev/profilers/Nsight_Systems/host-linux-x64/reports/cuda_gpu_kern_sum.py]...

 ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name                                      
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
    100.0        6,834,133          1  6,834,133.0  6,834,133.0  6,834,133  6,834,133          0.0  void thrust::THRUST_200300_90_NS::cuda_cub::core::_kernel_agent<thrust::THRUST_200300_90_NS::cuda_c…

Processing [report6.sqlite] with [/proj/nv/Linux_x86_64/249847-dev/profilers/Nsight_Systems/host-linux-x64/reports/cuda_gpu_mem_time_sum.py]...

 ** CUDA GPU MemOps Summary (by Time) (cuda_gpu_mem_time_sum):

 Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)               Operation
 --------  ---------------  -----  --------  --------  --------  --------  -----------  ------------------------------------
     99.2        4,258,635    562   7,577.6   4,159.0     2,911   193,474     15,723.5  [CUDA memcpy Unified Host-to-Device]
      0.8           35,196      8   4,399.5   1,535.5     1,023    23,809      7,869.6  [CUDA memcpy Unified Device-to-Host]

Processing [report6.sqlite] with [/proj/nv/Linux_x86_64/249847-dev/profilers/Nsight_Systems/host-linux-x64/reports/cuda_gpu_mem_size_sum.py]...

 ** CUDA GPU MemOps Summary (by Size) (cuda_gpu_mem_size_sum):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)               Operation
 ----------  -----  --------  --------  --------  --------  -----------  ------------------------------------
     16.974    562     0.030     0.008     0.004     1.004        0.093  [CUDA memcpy Unified Host-to-Device]
      0.262      8     0.033     0.016     0.004     0.106        0.037  [CUDA memcpy Unified Device-to-Host]

YangBaxter · March 19, 2024, 3:21pm

Thanks, makes sense to me.

system · April 2, 2024, 3:22pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nvc++ & external CUDA-thrust conflicts for -stdpar offload nvc, nvc++ and nvfortran	5	450	December 12, 2022
Vector push_back in Thrust device code and strange behavior of nvc++/stdpar nvc, nvc++ and nvfortran cuda	0	1099	January 18, 2022
CUDA version not available message with nvc++ on Ubuntu nvc, nvc++ and nvfortran	11	7669	April 30, 2021
cudaMemcpyAsync execution before and after Level 1 cuBLAS kernel calls nvc, nvc++ and nvfortran cuda	7	106	October 29, 2024
Nvc++ -stdpar functionality possible without single compilation unit? host linker? nvc, nvc++ and nvfortran	4	746	December 30, 2022
Nvcc c++20 std::variant complie failed CUDA NVCC Compiler cuda , nvbugs , nvcc	4	819	July 24, 2024
Cuda runtime failing to launch kernel from CSharp Wrapper function. CUDA Programming and Performance	6	1373	April 13, 2016
Cuda application crashes works fine for small data and crashes for big data CUDA Programming and Performance	3	414	October 12, 2021
Device code generated from -stdpar versus thrust nvc, nvc++ and nvfortran	12	2457	June 13, 2022
Accelerating Standard C++ with GPUs Using stdpar Technical Blog	7	1396	July 28, 2023

Nvc++: undefined __kmpc_for_static_init_16 and Unexpected branch type

Related topics