LLVM Error when compiling C++ STD parallel execution policies to GPU

Hello everyone,

I am trying to compile a C++ code with execution policies to Nvidia GPU. However, I am getting the following error:

main:
    141, stdpar: Generating NVIDIA GPU code
        141, std::transform_reduce with std::execution::par policy parallelized on GPU
__gnu_cxx::__enable_if<std::__is_scalar<int>::__value, void>::__type std::__fill_a1<int*, int>(int*, int*, int const&):
    930, Recognized memory set idiom
__gnu_cxx::__enable_if<std::__is_scalar<double>::__value, void>::__type std::__fill_a1<double*, double>(double*, double*, double const&):
    930, Recognized memory set idiom
Intrinsic has incorrect return type!
%struct.DT1_256705 (i64, i64)* @llvm.umul.with.overflow.i64
Intrinsic has incorrect return type!
%struct.DT1_256705 (i64, i64)* @llvm.umul.with.overflow.i64
NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (ep.cpp: 659)
NVC++/x86-64 Linux 24.3-0: compilation aborted

This code correctly compiles and executes on g++.

Would appreciate any help I can get.

nvc++ version 24.3-0, CUDA 12.4, g++ 13.1

Hi renatowbhf,

Can you post a reproducing example?

It’s a somewhat generic error so can quite tell if it’s compiler issue or some incompatibility with the g++ 13.1 STL. Having a reproducer will help determine the issue and I can then file a report if it’s a compiler issue.

Thanks,
Mat

1 Like

Hello MatColgrove,

I am attaching PROBLEM.zip containing the necessary code to reproduce the error from this post.

PROBLEM.zip (32.0 KB)

There is a simple Makefile, in which gpp rule works fine while nvidia (nvc++) doesn’t.

Thanks,
Renato

Thanks Renato. Though for good or bad, I’m not able to reproduce the error on my system. While I see linker errors due to missing device routines, the compilation is fine.

What type of system are you using? What OS? Is your GNU 13.1 the system compiler or something you installed?

I might be able to find a system that’s similar enough to yours and try again in case the issue is system specific.

-Mat

% nvc++ -std=c++23 -stdpar=gpu common/c_print_results.cpp common/c_randdp.cpp common/wtime.cpp common/c_timers.cpp problem.cpp -o nvidia -gpu=cuda12.4 -V24.3 --gcc-toolchain=/home/sw/thirdparty/gcc/gcc-13.1.0/Linux_x86_64/ -Minfo
common/c_print_results.cpp:
common/c_randdp.cpp:
randlc(double&, double):
     91, FMA (fused multiply-add) instruction(s) generated
    102, FMA (fused multiply-add) instruction(s) generated
    103, FMA (fused multiply-add) instruction(s) generated
    105, FMA (fused multiply-add) instruction(s) generated
    106, FMA (fused multiply-add) instruction(s) generated
    108, FMA (fused multiply-add) instruction(s) generated
vranlc(int, double&, double, double*):
    149, FMA (fused multiply-add) instruction(s) generated
    167, FMA (fused multiply-add) instruction(s) generated
    168, FMA (fused multiply-add) instruction(s) generated
    170, FMA (fused multiply-add) instruction(s) generated
    171, FMA (fused multiply-add) instruction(s) generated
    173, FMA (fused multiply-add) instruction(s) generated
common/wtime.cpp:
common/c_timers.cpp:
__gnu_cxx::__enable_if<std::__is_scalar<double>::__value, void>::__type std::__fill_a1<double*, double>(double*, double*, double const&):
    930, Recognized memory set idiom
problem.cpp:
main:
     83, stdpar: Generating NVIDIA GPU code
         83, std::transform_reduce with std::execution::par policy parallelized on GPU
__gnu_cxx::__enable_if<std::__is_scalar<int>::__value, void>::__type std::__fill_a1<int*, int>(int*, int*, int const&):
    930, Recognized memory set idiom
__gnu_cxx::__enable_if<std::__is_scalar<double>::__value, void>::__type std::__fill_a1<double*, double>(double*, double*, double const&):
    930, Recognized memory set idiom
nvlink error   : Undefined reference to '_ZSt20__throw_length_errorPKc' in '/tmp/nvc++N-7Dm1SUMrCjd.o'
nvlink error   : Undefined reference to '_Z6randlcRdd' in '/tmp/nvc++N-7Dm1SUMrCjd.o'
nvlink error   : Undefined reference to '_Z10timer_stopi' in '/tmp/nvc++N-7Dm1SUMrCjd.o'
nvlink error   : Undefined reference to '_Z6vranlciRddPd' in '/tmp/nvc++N-7Dm1SUMrCjd.o'
nvlink error   : Undefined reference to '_Z11timer_starti' in '/tmp/nvc++N-7Dm1SUMrCjd.o'
pgacclnk: child process exit status 2: /proj/nv/Linux_x86_64/24.3/compilers/bin/tools/nvdd
1 Like

Hello Mat,

Thanks again for your help.

So I tested on a different system and got the same output as you posted here.

System 1 (where things are working):

Ubuntu 22, Kernel 6.5.0-27
default g++ is 11 (localrc is g++-11)
CUDA version 12.3
NVIDIA driver 545.23.08

System 2 (where things are NOT working):

Ubuntu Server 20.04.6, Kernel 5.4.0-176
default g++ is 9 but compiled with g++-13 (localrc is g++-13)
CUDA version 12.4 (but NVC++ is using 12.3)
NVIDIA driver 550.54.15

So I tested System 1 setting localrc to g++13 and the same issue showed up.
In conclusion, g++11 works, did not test g++12, g++13 does NOT work.

Another question, what is the correct way of linking things using NVC++?

I just can’t seem to get rid of linking errors even tough all of those functions (which have their names mangled by NVC++), are in the common/ directory and are included in the compilation path. Things work fine in g++.
I checked the symbols from the object files generated by NVC++, all functions are there.

Best,
Renato

Ok, so then it seems to be a g++13 compatibility issue. 13 change a lot of things which caused use issue when it first came out, but those have largely been resolved. Unfortunately what’s different about your build from mine that cause this error, I’m not sure.

Let me ask engineering if they might have any ideas. Though you may need to use the 11 or 12 STL for now.

I just can’t seem to get rid of linking errors even tough all of those functions (which have their names mangled by NVC++), are in the common/ directory and are included in the compilation path. Things work fine in g++.
I checked the symbols from the object files generated by NVC++, all functions are there.

These are device linker errors. The symbols you’re seeing are likely for the host, but not the device. Also I don’t think g++ targets the device, so why it works.

In order to call routines from offloaded STDPAR, there needs to be a device version available. The compile will implicitly create these for you but only if the definition is available in the same compiling unit. Since these are in separate files it can’t. There’s a few things we can try.

  1. You can decorate the called routines with the OpenACC “#pragma acc routine” directive just before the routine’s prototype. If these routines in turn call other routines, then you need to add it those as well.

I thought I saw some global variables as well, if so, then these need to be decorated with a “#pragma acc declare create” and then a “#pragma acc update device(varname)” after it’s initialized so the device has the same value of the host. If the global variable is statically initialized, then you can use use just “#pragma acc declare copyin(varname)”.

  1. You can try inlining the routines. Since you have all the source on the same compile line, you can try adding “-Minline”. If the real project compiles these separately, you’ll need to do a two pass compile. First compiling all files with “-Mextract=lib:” to create the inline library, and then recompile with “-Minline=lib:” to use the library.

If there is a deep call graph, then you might need to increase the number of levels via “-Minline=levels:10” (i.e. up to 10 levels of calls).

Note that there’s no exception handling on the device so you’ll need to find where the throw is coming from and see if you can remove it.

1 Like

Oh I see. Makes sense that NVC++ only compiles to GPU what it assumes to be ‘Kernel’ code.

Your solution with -Minline solved the issue for me.

One final question:
I know we can not have C++ STL code inside Kernels in regular CUDA.
Can NVC++ handle STL code such as std::vector, std::advance, or std::make_pair?

I know that would require NVC++ to understand that C++ code and transform it into a C equivalent, so I am 99% sure most STL code should not work.

I ask this because in my research group we have a CFD benchmark in CUDA, and also a full C++ STL version that compiles and executes nicely with parallel policies targeting multicore.

Now I believe that to test NVC++'s parallel_policies generations we would have to fuse those two versions together, mostly getting rid of all C++ STL code besides the algorithms such as std::transform_reduce and such.

Thanks again for your help!

Hi Renato,

Engineering took a look and was able to reproduce the error. Turns out that it only fails with g++ 13 with CUDA 12.3 but is fine with CUDA 12.4. Since 24.3 only ships with CUDA 12.3 this is why you’re seeing the error, and my mistake for explicitly using CUDA 12.4. (note CUDA 12.4 should be included in the 24.5 HPC SDK).

I added a problem report, TPR#35538. Although it seems to be ok in CUDA 12.4, I still prefer they investigate in case the error is only masked but not actually fixed.

The engineer also noted:

The code is not GPU friendly. The lambda that is passed to the parallel transform_reduce captures by reference (which don’t work on GPUs without unified memory) and is mutable (which is problematic for any parallel implementation, on CPU or GPU). I don’t think either of those are actually used, so the capture-by-reference (& ) and mutable can probably be removed. Inside the lambda body there are uses of std::shared_ptr and std::vector . The use of std::vector is not recommended due to the dynamic memory allocation in device code. I don’t think std::shared_ptr works at all in device code. I don’t see any reason to use std::shared_ptr here because the object is only accessed from one thread at a time and the lifetime of the object is well known.

Can NVC++ handle STL code such as std::vector, std::advance, or std::make_pair?

Vectors are ok provided it’s only accessing data. Push, pops, etc. can trigger dynamic reallocation of memory which would cause issues (not just for the device but also can’t be done in parallel).

I’m not sure about “advance” and “make_pair”. Though “make_pair” might be ok.

There was an effort to get the full STL on to the device with libcu++ being the intermediary step, but I’m not sure where they’re at on it. Things like system calls, exceptions, and static globals, makes it difficult. I’m not sure but it may be easier with full Unified Memory support.

All right, thanks for all the help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.