C++ Smart Pointers and OpenACC

Hello,
My objective is to see if I can use shared_ptr in c++ to pass resources to compute regions on gpu using OpenACC.

The code snippet is as follows:

#include <iostream>
#include <memory>
#include <vector>
#include <openacc.h>
int main(){
    // Create a unique pointer to an STL vector of ints
    std::shared_ptr<std::vector<int>> vec_ptr = std::make_shared<std::vector<int>>();

    // Fill the vector with some data
    for (int i = 0; i < 10; ++i) {
        vec_ptr->push_back(i);
    }
    #pragma acc enter data copyin(vec_ptr[0:1], vec_ptr[0:9])

    // Parallelize a loop over the vector using OpenACC
    #pragma acc data copyout(vec_ptr[0:9])
    {
      #pragma acc parallel loop
      for (int i = 0; i < vec_ptr->size(); ++i) {
        printf(" The host, device flags are %d, %d \n", acc_on_device(acc_device_host), acc_on_device(acc_device_nvidia));
        // Access and modify vector elements safely in parallel
        (*vec_ptr)[i] *= 2;
      }
    }

    #pragma acc update self (vec_ptr[0:9])
    //Print the modified vector from host
    for (int i = 0; i < vec_ptr->size(); ++i)  {
        std::cout << (*vec_ptr)[i] << " ";
    }
    std::cout << std::endl;

    #pragma acc exit data delete (vec_ptr[0:999], vec_ptr[0:1])
    return 0;
}

I am using nvhpc 23.2 with cuda toolkit 12.0 and cmake/3.23.0. My device details are the following:
[simon@krakengpu1 build]$ nvidia-smi
Fri May 3 16:24:17 2024
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE… On | 00000000:2F:00.0 Off | 0 |
| N/A 47C P0 135W / 250W | 2900MiB / 16384MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 86839 C …s/phydll_train/bin/python 688MiB |
| 0 N/A N/A 90677 C python 2204MiB |
±----------------------------------------------------------------------------+

My build looks as follows:

[simon@krakengpu1 build]$ make
Consolidate compiler generated dependencies of target testSmartPtrs
[ 50%] Building CXX object CMakeFiles/testSmartPtrs.dir/src/main.cpp.o
nvc++-Warning-CUDA_HOME has been deprecated. Please, use NVHPC_CUDA_HOME instead.
main:
     18, Generating enter data copyin(vec_ptr)
         Generating copyout(vec_ptr) [if not already present]
         Generating NVIDIA GPU code
         20, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     26, Generating update self(vec_ptr)
     33, Generating exit data delete(vec_ptr)
std::__shared_ptr<std::vector<int, std::allocator<int>>, (__gnu_cxx::_Lock_policy)2>::operator*() const:
      2, include "memory"
          10, include "memory"
               82, include "shared_ptr.h"
                    52, include "shared_ptr_base.h"
                        906, Generating implicit acc routine seq
                             Generating acc routine seq
                             Generating NVIDIA GPU code
std::__shared_ptr<std::vector<int, std::allocator<int>>, (__gnu_cxx::_Lock_policy)2>::operator->() const:
      2, include "memory"
          10, include "memory"
               82, include "shared_ptr.h"
                    52, include "shared_ptr_base.h"
                        913, Generating implicit acc routine seq
                             Generating acc routine seq
                             Generating NVIDIA GPU code
std::vector<int, std::allocator<int>>::size() const:
      3, include "vector"
          64, include "stl_vector.h"
              646, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating NVIDIA GPU code
std::vector<int, std::allocator<int>>::operator[](unsigned long):
      3, include "vector"
          64, include "stl_vector.h"
              771, Generating implicit acc routine seq
                   Generating acc routine seq
                   Generating NVIDIA GPU code
[100%] Linking CXX executable ../app/testSmartPtrs
nvc++-Warning-CUDA_HOME has been deprecated. Please, use NVHPC_CUDA_HOME instead.
[100%] Built target testSmartPtrs

Expected result vs obtained result
I would expect that printf() would return 0,1 ten times followed by a modified array {0 2 4 6 8 10 12 14 16 18}. However, what I get instead is that the printf() returns 1,0 ten times followed by a modified array {0 2 4 6 8 10 12 14 16 18}. This means that the computation has happened only on the host and not on the device.
testSmartPtr.zip (2.8 KB)

I am attaching a MWE along with its cmake file with this text. Any help or direction on why the code does not execute on the gpu is highly appreciated.

Thanks a ton,
Sangeeth

Hi Sangeeth,

The program seems to work correctly for me, so I suspect something else is going on.

testSmartPtr/build% ../app/testSmartPtrs
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
0 2 4 6 8 10 12 14 16 18

Note that there’s no NVHPC 23.2 release, so I’m assuming you’re using 23.1 or 23.3 (I get the same output for both). Also, your cmake wants to use the g++ flag “-fopenacc”, so I needed to override this flag via:

cmake -DOpenACC_CXX_FLAGS=-acc ../

Though since you’re not seeing the “unknown” flag error, I’m assuming you’re doing something similar.

We can test if the device is being used by setting the environment variable “NV_ACC_NOTIFY=1”. You should see something like the following. Note that “1” tells the runtime to print a message each time a kernel is launched.

% app/testSmartPtrs
launch CUDA kernel  file=/home/mcolgrove/tmp/testSmartPtr/src/main.cpp function=main line=18 device=0 threadid=1 num_gangs=1280 num_workers=1 vector_length=128 grid=1280 block=128
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
 The host, device flags are 0, 1
0 2 4 6 8 10 12 14 16 18

You should also check if the OpenACC runtime can find your CUDA driver by running the “nvaccelinfo” utility.

Finally, let’s try building the binary on the command line rather than using cmake.

% nvc++ -acc -gpu=managed src/main.cpp; a.out

If that works, then rebuild with cmake but add “make VERBOSE=1” so we can see how it’s building the binary. It might be adding something that’s causing the issue.

-Mat