C++ Smart Pointers and OpenACC

SangeethSimon · May 3, 2024, 2:25pm

Hello,
My objective is to see if I can use shared_ptr in c++ to pass resources to compute regions on gpu using OpenACC.

The code snippet is as follows:

#include <iostream>
#include <memory>
#include <vector>
#include <openacc.h>
int main(){
    // Create a unique pointer to an STL vector of ints
    std::shared_ptr<std::vector<int>> vec_ptr = std::make_shared<std::vector<int>>();

    // Fill the vector with some data
    for (int i = 0; i < 10; ++i) {
        vec_ptr->push_back(i);
    }
    #pragma acc enter data copyin(vec_ptr[0:1], vec_ptr[0:9])

    // Parallelize a loop over the vector using OpenACC
    #pragma acc data copyout(vec_ptr[0:9])
    {
      #pragma acc parallel loop
      for (int i = 0; i < vec_ptr->size(); ++i) {
        printf(" The host, device flags are %d, %d \n", acc_on_device(acc_device_host), acc_on_device(acc_device_nvidia));
        // Access and modify vector elements safely in parallel
        (*vec_ptr)[i] *= 2;
      }
    }

    #pragma acc update self (vec_ptr[0:9])
    //Print the modified vector from host
    for (int i = 0; i < vec_ptr->size(); ++i)  {
        std::cout << (*vec_ptr)[i] << " ";
    }
    std::cout << std::endl;

    #pragma acc exit data delete (vec_ptr[0:999], vec_ptr[0:1])
    return 0;
}

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 86839 C …s/phydll_train/bin/python 688MiB |
| 0 N/A N/A 90677 C python 2204MiB |
±----------------------------------------------------------------------------+

My build looks as follows:

[simon@krakengpu1 build]$ make
Consolidate compiler generated dependencies of target testSmartPtrs
[ 50%] Building CXX object CMakeFiles/testSmartPtrs.dir/src/main.cpp.o
nvc++-Warning-CUDA_HOME has been deprecated. Please, use NVHPC_CUDA_HOME instead.
main:
     18, Generating enter data copyin(vec_ptr)
         Generating copyout(vec_ptr) [if not already present]
         Generating NVIDIA GPU code
         20, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     26, Generating update self(vec_ptr)
     33, Generating exit data delete(vec_ptr)
std::__shared_ptr<std::vector<int, std::allocator<int>>, (__gnu_cxx::_Lock_policy)2>::operator*() const:
      2, include "memory"
          10, include "memory"
               82, include "shared_ptr.h"
                    52, include "shared_ptr_base.h"
                        906, Generating implicit acc routine seq
                             Generating acc routine seq
                             Generating NVIDIA GPU code
std::__shared_ptr<std::vector<int, std::allocator<int>>, (__gnu_cxx::_Lock_policy)2>::operator->() const:
      2, include "memory"
          10, include "memory"
               82, include "shared_ptr.h"
                    52, include "shared_ptr_base.h"
                        913, Generating implicit acc routine seq
                             Generating acc routine seq
                             Generating NVIDIA GPU code
std::vector<int, std::allocator<int>>::size() const:
      3, include "vector"
          64, include "stl_vector.h"
              646, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating NVIDIA GPU code
std::vector<int, std::allocator<int>>::operator[](unsigned long):
      3, include "vector"
          64, include "stl_vector.h"
              771, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating NVIDIA GPU code
[100%] Linking CXX executable ../app/testSmartPtrs
nvc++-Warning-CUDA_HOME has been deprecated. Please, use NVHPC_CUDA_HOME instead.
[100%] Built target testSmartPtrs

Expected result vs obtained result
I would expect that printf() would return 0,1 ten times followed by a modified array {0 2 4 6 8 10 12 14 16 18}. However, what I get instead is that the printf() returns 1,0 ten times followed by a modified array {0 2 4 6 8 10 12 14 16 18}. This means that the computation has happened only on the host and not on the device.
testSmartPtr.zip (2.8 KB)

I am attaching a MWE along with its cmake file with this text. Any help or direction on why the code does not execute on the gpu is highly appreciated.

Thanks a ton,
Sangeeth

MatColgrove · May 3, 2024, 3:54pm

Hi Sangeeth,

The program seems to work correctly for me, so I suspect something else is going on.

testSmartPtr/build% ../app/testSmartPtrs
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
0 2 4 6 8 10 12 14 16 18

Note that there’s no NVHPC 23.2 release, so I’m assuming you’re using 23.1 or 23.3 (I get the same output for both). Also, your cmake wants to use the g++ flag “-fopenacc”, so I needed to override this flag via:

cmake -DOpenACC_CXX_FLAGS=-acc ../

Though since you’re not seeing the “unknown” flag error, I’m assuming you’re doing something similar.

We can test if the device is being used by setting the environment variable “NV_ACC_NOTIFY=1”. You should see something like the following. Note that “1” tells the runtime to print a message each time a kernel is launched.

% app/testSmartPtrs
launch CUDA kernel  file=/home/mcolgrove/tmp/testSmartPtr/src/main.cpp function=main line=18 device=0 threadid=1 num_gangs=1280 num_workers=1 vector_length=128 grid=1280 block=128
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
0 2 4 6 8 10 12 14 16 18

You should also check if the OpenACC runtime can find your CUDA driver by running the “nvaccelinfo” utility.

Finally, let’s try building the binary on the command line rather than using cmake.

% nvc++ -acc -gpu=managed src/main.cpp; a.out

If that works, then rebuild with cmake but add “make VERBOSE=1” so we can see how it’s building the binary. It might be adding something that’s causing the issue.

-Mat

SangeethSimon · July 19, 2024, 9:36am

Hello Mat,

I apologize for the late reply due to some severe work deadlines. I revisited the problem by downloading, recompiling and executing this .zip ball in the same environment mentioned in my original post. The funny fact is that it seems to work well (i.e. produce the expected result) without any modifications. I do not know what could have fixed it.

However, taking inspiration from your suggestions, I also trimmed my CMakeLists.txt a bit further into the following:

cmake_minimum_required(VERSION 3.23)

project(
    testSmartPtr
    VERSION 0.1
)

#Set output directory
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY "${PROJECT_SOURCE_DIR}/app")

add_executable( testSmartPtr )
 target_sources(testSmartPtr PUBLIC
  src/main.cpp
  )
target_include_directories(testSmartPtr 
PUBLIC "${PROJECT_SOURCE_DIR}/include")

set(CMAKE_CXX_COMPILER "/softs/nvidia/hpc_sdk/Linux_x86_64/23.3/compilers/bin/nvc++")

find_package(OpenACC REQUIRED)
if(OpenACC_CXX_FOUND)
    message(STATUS "OpenACC found")
    set(CMAKE_CXX_FLAGS " -acc -gpu=cc70,managed -Minfo=all")
    target_link_libraries(testSmartPtr PUBLIC OpenACC::OpenACC_CXX)
endif()
find_package(CUDAToolkit REQUIRED)
if(CUDAToolkit_FOUND)
    message(STATUS "CUDA toolkit found")
endif()

This is for future reference for people who are trying to build OpenACC projects with C++ and modern CMake.

Thank you for your useful suggestions which I am sure will come very handy in debugging future projects along the same lines.

Thanks and Regards,
Sangeeth

cassfalg · July 31, 2024, 2:38pm

I personally use something like this in my cmakes:

option(USE_OPENACC "Enable OpenACC pragmas")
option(OPENACC_AUTOCOMPARE "Compare results of CPU and GPU execution")

if(USE_OPENACC)
    find_package(OpenACC REQUIRED)
    # -Minfo=accel - print info on accelerator regions during compilation
    # -gpu options:
    #   ccall - generate code for all supported compute archs
    #   lineinfo - include line info into gpu code for debugging
    #   nordc - disable RDC, seemingly required for gcc to link against this
    # -tp=x86-64-v3 to restrict the target instruction set to AVX2 instead of
    #   the default, which probably is native and may include e.g. AVX512
    # -static-nvidia link in runtime libs statically for portability
    if(CMAKE_C_COMPILER_ID STREQUAL "NVHPC")
        target_compile_options(OpenACC::OpenACC_C
            INTERFACE -Minfo=accel
                -gpu=ccall,nordc,lineinfo
                -tp=x86-64-v3)
        target_link_options(OpenACC::OpenACC_C INTERFACE -static-nvidia -gpu=nordc,ccall)
        target_compile_options(OpenACC::OpenACC_CXX
            INTERFACE -Minfo=accel
                -gpu=ccall,nordc,lineinfo
                -tp=x86-64-v3)
        target_link_options(OpenACC::OpenACC_CXX INTERFACE -static-nvidia -gpu=nordc,ccall)
        if(OPENACC_AUTOCOMPARE)
            target_compile_options(OpenACC::OpenACC_C INTERFACE -gpu=autocompare)
            target_compile_options(OpenACC::OpenACC_CXX INTERFACE -gpu=autocompare)
        endif(OPENACC_AUTOCOMPARE)
    else()
        message(WARNING "The compiler ID is: ${CMAKE_C_COMPILER_ID}")
        message(WARNING "OpenACC pragmas enabled, but only tested with nvhpc compilers")
    endif()
endif()

And then link against the openacc package provided targets like you do. The main change I suppose is that I add my customized flags to that target, instead of the global CMAKE_CXX_FLAGS. That feels more like modern cmake to me.

With older cmake versions the compiler ID is PGI, so the if may need to be extended.

I have never tried to overwrite the compiler (especially after the project call), not sure that even works in all cases. Same with modifying output directories. That should probably be in presets nowadays…

Then again we have some more automation around cmake for automatic build and packaging of multiple variants (like openacc enabled and disabled) that also provides different toolchains and needs to specify output directories for example. What you do would break in our environment but is likely very convenient for you.

Topic		Replies	Views
Problem with OpenAcc and CPP STL nvc, nvc++ and nvfortran cuda	17	776	January 26, 2024
Using classes in openACC nvc, nvc++ and nvfortran	11	761	March 20, 2023
Can an OpenACC accelerated shared object contain cpu and gpu code both? nvc, nvc++ and nvfortran	3	272	April 30, 2024
Implicit data copy to device for allocated arrays using compilation option -stdpar=gpu nvc, nvc++ and nvfortran	11	702	May 31, 2023
Dealing with allocatable arrays with OpenACC Legacy PGI Compilers	8	1917	November 30, 2020
Accelerating Standard C++ with GPUs Using stdpar Technical Blog	7	1416	July 28, 2023
Use std::vector with OpenACC nvc, nvc++ and nvfortran	6	1294	December 30, 2022
OpenACC: Copying array, returned by a function, to device nvc, nvc++ and nvfortran	10	983	August 11, 2021
How to map private dynamic array to the GPU with OpenMP and nvc? nvc, nvc++ and nvfortran	20	194	January 31, 2025
OpenACC regions with C++ structs nvc, nvc++ and nvfortran	3	848	January 8, 2021

C++ Smart Pointers and OpenACC

Related topics