Separate compilation and linking with nvc++

PGK1 · November 1, 2021, 6:57pm

Hi,

Is it possible for nvc++ to use function objects defined in a different translation unit to the one calling a C++ standard library algorithm? As a small example, if the contents of 3 files, main.cpp, squared.cpp and squared.hpp are:

main.cpp:

#include "squared.hpp"
#include <vector>
#include <iostream>
#include <algorithm>
#include <execution>

int main(int argc, char *argv[])
{
  std::vector<int> v(1<<20,7);
  const auto pol = std::execution::par_unseq;
  std::for_each(pol, v.begin(), v.end(), squared{});
  std::cout << v[0] << '\n';
  return 0;
}

squared.cpp:

#include "squared.hpp"

void squared::operator()(int& x) { x = x * x; }

squared.hpp:

#ifndef __SQUARED_HPP__
#define __SQUARED_HPP__
                                               
struct squared                                                                 
{                                                                              
  void operator()(int&);                                                       
};                                                                             
                                                                               
#endif // __SQUARED_HPP__

… a command such as nvc++ -stdpar -std=c++17 squared.cpp main.cpp will fail to link due to an undefined reference to squared::operator()(int&). I’m using nvc++ from v21.9 of the HPC SDK on Ubuntu 20.10.

Thanks,
Paul

MatColgrove · November 1, 2021, 9:26pm

Hi Paul,

The C++ standard doesn’t have a method to decorate routines to note that a device version needs to be created. Hence we rely on the compiler being able to implicitly generate this. However, it does need to discover this and can’t do so across compilation units so instead needs to rely on non-standard extensions here.

You can either decorate the routine with the CUDA “host device” attribute, or the OpenACC “acc routine” pragma.

% cat squared_d.cpp
#include "squared.hpp"

#ifdef _NVHPC_STDPAR_GPU
__host__ __device__
#endif
void squared::operator()(int& x) { x = x * x; }
% cat squared_acc.cpp
#include "squared.hpp"

#pragma acc routine
void squared::operator()(int& x) { x = x * x; }
% nvc++ -fast -stdpar squared_d.cpp main.cpp -V21.9; a.out
squared_d.cpp:
main.cpp:
49
% nvc++ -fast -stdpar -acc -Minfo=accel squared_acc.cpp main.cpp -V21.9 ; a.out
squared_acc.cpp:
squared::operator ()(int &):
      4, Generating acc routine seq
         Generating Tesla code
main.cpp:
49

Hope this helps,
Mat

PGK1 · November 2, 2021, 8:28pm

Thanks Mat, that’s really helpful.

I assume there isn’t a flag to instruct nvc++ to create device versions of all routines it encounters?

Paul

MatColgrove · November 3, 2021, 5:55pm

That’s a good question. We have that functionality in OpenACC (-acc=routineseq i.e. compile every routine for the device), but I don’t think we’ve tested it with stdpar.

I just tried on your simple example and it looks like the compiler attempts to offload some of the implicitly include Thrust routines. Plus it looks like it’s trying redefine some device attribute routines. It is a big hammer approach so likely would need refinement before it could be used with stdpar.

Let me ask our C++ folks if it would even be feasible to implement, and if so, I can add a request for enhancement (RFE).

aklinvex · January 2, 2024, 11:03pm

It’s been 2 years, so I was wondering if this guidance (use host device/acc routine) is still up to date. Is there a better way to handle this now?

MatColgrove · January 3, 2024, 12:22am

Hi aklinvex,

Yes, this is still the way I’d recommend handling compilation of device routines who’s definition is in a separate source file.

Another option to try if you can’t decorate the routines would be to use cross-file inlining so the call isn’t needed. Though cross-file inlining is a bit of a pain since it requires a two pass compile. First with the “-Mextract=lib:libname” flag across all the sources to create an inline library, and then a second pass with “-Minline=lib:libname” to inline the routines. Not all routines can be inlined, in particular larger routines, so you still might need to fall back to using “acc routine”.

-Mat

Topic		Replies	Views
Nvc++ -stdpar functionality possible without single compilation unit? host linker? nvc, nvc++ and nvfortran	4	724	December 30, 2022
Can an OpenACC accelerated shared object contain cpu and gpu code both? nvc, nvc++ and nvfortran	3	252	April 30, 2024
Conditional compilation of CPU/GPU code with nvc++ nvc, nvc++ and nvfortran	4	1211	November 19, 2021
LLVM Error when compiling C++ STD parallel execution policies to GPU nvc, nvc++ and nvfortran	9	401	May 2, 2024
Problem with OpenAcc and CPP STL nvc, nvc++ and nvfortran cuda	17	682	January 26, 2024
C++ Smart Pointers and OpenACC nvc, nvc++ and nvfortran nvcc	3	307	July 31, 2024
Standard parallel C++ nvc, nvc++ and nvfortran	1	579	March 30, 2022
Device code generated from -stdpar versus thrust nvc, nvc++ and nvfortran	12	2398	June 13, 2022
Calling CUDA-library functions in OpenACC parallel region Legacy PGI Compilers	4	6399	October 26, 2018
Using OpenACC with C++ class member functions that have been compiled to static or shared libraries nvc, nvc++ and nvfortran	5	680	October 2, 2021

Separate compilation and linking with nvc++

Related topics