Nvc++ -stdpar functionality possible without single compilation unit? host linker?

I am attempting to leverage the nvc++ compiler to produce GPU-accelerated C++ Standard Library calls within a mex file for Matlab. To start with, I built a minimal toy example purely in C++ (no mex/Matlab) that has two modules: one that serves as a main driver, and one that holds the actual processing code. The main driver does memory allocation, then calls the processing code, then exits. Here it is:

test_standalone.cpp: (main driver)

#include <vector>
#include <iostream>

size_t VEC_NUM_ELEM = 10;

void test_execute(float *array_in, float *array_out, const size_t vec_size, const float coeff); // processing code

int main(int argc, char **argv)
{
    if (argc != 2)
    {
        std::cout << "Try: " << argv[0] << "<coeff> " << std::endl;
        return -1;
    }

    const float coeff = std::stof(argv[1]);

    std::cout << "Coeff: " << coeff << std::endl;

    float *vec1_array = (float *)malloc(VEC_NUM_ELEM * sizeof(float));
    float *vec2_array = (float *)malloc(VEC_NUM_ELEM * sizeof(float));

    for (unsigned i = 0; i < VEC_NUM_ELEM; i++)
    {
         val = static_cast<float>(i);
    }

    test_execute(vec1_array, vec2_array, VEC_NUM_ELEM, coeff);

    return 0;
} // end main driver

test_execute.cpp: (“processing”/CPU/GPU code):

#include <algorithm> // std::transform
#include <iostream>
#include <cassert>

#include "cuda_runtime.h"

#define assertm(exp, msg) assert(((void)msg, exp))

void test_execute(float *array_in, float *array_out, const size_t vec_size, const float coeff)
{
    cudaPointerAttributes attrib_in;
    cudaError_t cerr = cudaPointerGetAttributes(&attrib_in, array_in);
    assertm(cerr == cudaSuccess, "cudaPointerGetAttributes() failed!");

    std::cout << "array_in memory type: " << attrib_in.type << std::endl;

    std::transform
    (
        std::execution::par_unseq,
        array_in,
        array_in + vec_size,
        array_out,
        [coeff](float a)
        {
            float ret = a + coeff;
            return ret;
        }
    );
    
} // end processing code test_execute.cpp

Here’s how I am building:


/test/nvhpc_2022_2211_Linux_x86_64_cuda_11.8/install_location/Linux_x86_64/22.11/compilers/bin/nvc++ -stdpar=gpu -c test_execute.cpp

/test/nvhpc_2022_2211_Linux_x86_64_cuda_11.8/install_location/Linux_x86_64/22.11/compilers/bin/nvc++ -stdpar=gpu test_execute.o test_standalone.cpp -o test_standalone.exe -L/test/nvhpc_2022_2211_Linux_x86_64_cuda_11.8/install_location/Linux_x86_64/22.11/compilers/lib -L/test/nvhpc_2022_2211_Linux_x86_64_cuda_11.8/install_location/Linux_x86_64/22.11/cuda/11.8/lib64 -lcudanvhpc -lcudart -lnvc -lnvhpcatm -lpgm -Wl,-rpath=/test/nvhpc_2022_2211_Linux_x86_64_cuda_11.8/install_location/Linux_x86_64/22.11/compilers/lib -Wl,-rpath=/test/nvhpc_2022_2211_Linux_x86_64_cuda_11.8/install_location/Linux_x86_64/22.11/cuda/11.8/lib64

When I run ./test_standalone.exe, I get a memory access error. I suspected that this was because of memory type since the CPU (“stdpar=multicore”) path works, so I added the CUDA code to get pointer attributes and I confirmed that the memory being passed into test_execute is “host unregistered” memory. Next, I copy/pasted the test_execute() function definition into test_standalone(), then compiled it all as one file and everything ran great (same code told me that the memory was now managed memory). I have two questions:

  1. What is the correct way to separately compile modules with stdpar=gpu functionality from nvc++? I know separate compilation and linking from nvcc, but I don’t think that should be necessary here as device code should be generated and able to be in-lined all within test_execute().

  2. If separate compilation described in #1 is possible, am I able to link with a host compiler like g++ (ultimately to use with mex)? I tried changing to g++ in my linking step above and I was able to compile and link, but when I ran ./test_execute.exe, I got “No CUDA device code available”. Being restricted to only nvc++ is incredibly limiting so I’m hoping there’s a way to achieve this.

Thank you!

NV HPC SDK: nvhpc_2022_2211_Linux_x86_64_cuda_11.8
RHEL 7

Hi mfeuling01,

The problem here is that you’re explicitly adding the “-lnvc” library on the command line. nvc++ will implicitly add the nvc runtime library so no need for you to add it. But when you do, you’re putting it out of order in where it needs to be on the link line.

No need to put any of the compiler runtime libs on the link line and we implicitly set the rpath, so no need for that either.

% nvc++ -stdpar=gpu test_execute.o test_standalone.cpp -o test_standalone.exe -V22.11
test_standalone.cpp:
% ./test_standalone.exe 1                                                                                                       
Coeff: 1
array_in memory type: 3
  1. What is the correct way to separately compile modules with stdpar=gpu functionality from nvc++? I know separate compilation and linking from nvcc, but I don’t think that should be necessary here as device code should be generated and able to be in-lined all within test_execute().

Here you’re only needing host linking so not an issue. For device linking, we enable RDC by default and nvc++ will invoke the device linker as part of the link step.

  1. If separate compilation described in #1 is possible, am I able to link with a host compiler like g++ (ultimately to use with mex)? I tried changing to g++ in my linking step above and I was able to compile and link, but when I ran ./test_execute.exe, I got “No CUDA device code available”. Being restricted to only nvc++ is incredibly limiting so I’m hoping there’s a way to achieve this.

If you’re creating a shared object or linking using a different compiler like g++, then you may need to add “-gpu=nordc”. This removes the need for device linking.

The caveat being that without RDC, function calling from device code (i.e. a function called from within the body of the transform) can only be made to other device functions defined in the same source file (so they can be inlined).

Hope this helps,
Mat

1 Like

Hi Mat,

Thank you for the quick response.

The problem here is that you’re explicitly adding the “-lnvc” library on the command line. nvc++ will implicitly add the nvc runtime library so no need for you to add it. But when you do, you’re putting it out of order in where it needs to be on the link line.

No need to put any of the compiler runtime libs on the link line and we implicitly set the rpath, so no need for that either.

That makes sense. I probably had all the gratuitous linking and rpath stuff from when I was attempting the link step with g++ and just switched compilers. The good news is that I was able to successfully run with nvc++ on both compile and link steps when I removed all the manual linking of compiler libraries. Thanks!

The bad new is that I’ll need to be able to do the final link step with g++. As suspected, it’s not so easy as it is with the nvc++ link:

nvc++ -stdpar=gpu -gpu=nordc -c test_execute.cpp

g++ -std=c++17 test_execute.o test_standalone.cpp -o test_standalone.exe 

Very many complaints about undefined references from mostly, but possibly not limited to:
__pgi_* functions
CUDA runtime functions
Mcuda_compiled

It’s clear I’ll need to link against nvc++ compiler libraries and the CUDA runtime libraries to link with g++. From what you implied in your solution to my first problem, the order of that linking matters. I could do the legwork of finding necessary functions in the SDK’s .so files g++ complains about not having, but is there anything I should know about how to order them?

Sorry, I was incomplete on my answer before in terms of your ultimate goal of getting this code into a shared object.

The core problem here is that CUDA Unified Memory (UM) is used by default so users don’t need to manage the device data themselves. Also for things like vectors, these are very difficult to managed manually but trivial with UM. To make this implicit when using the “-gpu=managed” flag (enabled by default with -stdpar=gpu), the compiler will automatically replace allocation calls using malloc and new with “cudaMallocManaged”. However if you’re using g++ to compile the file that contains the allocation, this replacement isn’t done and you’ll be passing host addresses to the device, thus causing an error.

To fix, you’ll need to manually mange the device data. You can do this with calls to cudaMalloc or cudaMallocManaged, but for ease of use, I’ll use OpenACC data directives. I put them in “test_execute” itself which means the data will be transferred each time. You might need to add additional routines that can be called from main to create the device data and then pass the device data to the computational routines.

Note that I’m using the “-gpu=nomanaged” to disable UM and using a shared object so I don’t need to worry about adding the NV runtime libraries on the g++ link line. However, you’ll need to set the LD_LIBRARY_PATH so the shared objects can be found by the loader.

test_execute.cpp:

#include <algorithm> // std::transform
#include <iostream>
#include <cassert>
#include <execution>

#include "cuda_runtime.h"
#include "openacc.h"

#define assertm(exp, msg) assert(((void)msg, exp))

void test_execute(float *array_in, float *array_out, const size_t vec_size, const float coeff)
{

#pragma acc data copyin(array_in[:vec_size]) copyout(array_out[:vec_size])
{

    float * d_array_in = (float*) acc_deviceptr(array_in);
    float * d_array_out = (float*) acc_deviceptr(array_out);
    cudaPointerAttributes attrib_in;
    cudaError_t cerr = cudaPointerGetAttributes(&attrib_in, d_array_in);
    assertm(cerr == cudaSuccess, "cudaPointerGetAttributes() failed!");

    std::cout << "array_in memory type: " << attrib_in.type << std::endl;

    std::transform
    (
        std::execution::par_unseq,
        d_array_in,
        d_array_in + vec_size,
        d_array_out,
        [coeff](float a)
        {
            float ret = a + coeff;
            return ret;
        }
    );

}

} // end processing code test_execute.cpp

Building the code:

luna:/local/home/mcolgrove/uf1% setenv LD_LIBRARY_PATH /proj/nv/Linux_x86_64/22.11/compilers/lib/:/local/home/mcolgrove/uf1
luna:/local/home/mcolgrove/uf1% nvc++ -o libexecute.so test_execute.cpp -shared -fPIC -stdpar=gpu -gpu=nomanaged,nordc -acc
luna:/local/home/mcolgrove/uf1% g++ test_standalone.cpp -L./ -lexecute
luna:/local/home/mcolgrove/uf1% ./a.out 1
Coeff: 1
array_in memory type: 2

-Mat

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.