Nvc++ 25.1, stdpar, HtoD for already fetched member variable

m.gondrum · February 21, 2025, 9:19am

Dear NVIDIA team,

using the c++ parallel algorithms of stdpar in NVHPC SDK 25.1 (and previous) an additional upload is triggered when using a pointer being a class member.
However, no upload is triggered when aliasing this pointer locally.

Interestingly, the observed HtoD cannot be related to the actual pointed memory as this would require more transfer as can be seen in an ‘init run’ we perform to trigger upload.

Is there a way to get this working?
Our multi-physics code relays on such accessor structure and explicitly creating a point is not an everywhere an option.

Best regards
Miro

The minimum working example:

#include <algorithm>
#include <execution>
#include <vector>
#include <nvtx3/nvToolsExt.h>

class DummyClass {
  public:
   DummyClass(const int size) : m_index(size), p_vec(new double[size]) {
     for (int i = 0; i < size; i++) m_index[i] = i;
   };
   void dummyParallelFor();
   void dummyParallelForAliased();

  private: // same problem for public, too.
   double* const p_vec;
   std::vector<int> m_index; // dummy index
};

inline void DummyClass::dummyParallelFor() {
  std::for_each(std::execution::par_unseq, m_index.begin(), m_index.end(),
                [=](const int& i) {
                  // dummy read and write pattern
                  if (p_vec[i] < 0.0) {
                    p_vec[i] = 2.0;
                  }
                });
};

inline void DummyClass::dummyParallelForAliased() {
  double* p_p_vec = p_vec; // the 'aliasing'
  std::for_each(std::execution::par_unseq, m_index.begin(), m_index.end(),
                [=](const int& i) {
                  if (p_p_vec[i] < 0.0) {
                    p_p_vec[i] = 2.0;
                  }
                });
};

int main() {
  constexpr int size = 40000; // any 'large' number
  DummyClass dc(size);

  // To trigger the sync HtoD
  nvtxRangePushA("init run : aliased");
  dc.dummyParallelFor();
  nvtxRangePop();
  nvtxRangePushA("init run : aliased");
  dc.dummyParallelForAliased();
  nvtxRangePop();

  // Relevant profiling
  nvtxRangePushA("aliased");
  dc.dummyParallelForAliased();
  nvtxRangePop();

  nvtxRangePushA("non-aliased");
  dc.dummyParallelFor();
  nvtxRangePop();
}

The compile instructions used on a
Intel Xeon W-2155 with an NVIDIA Quadro RTX 6000

CXX="/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/bin/nvc++"
CXXFLAGS=" -O2 -std=c++17 -stdpar=gpu -gpu=cc75,mem:managed,cuda12.6 "
$CXX $CXXFLAGS main.cpp -o main

Profiling call

nsys profile --trace=cuda,nvtx,mpi --force-overwrite=true --output=report.nsys-rep ./main

The outcome

report.nsys-rep (299.2 KB)

MatColgrove · February 21, 2025, 5:09pm

Hi Miro,

Perhaps I’m not quite understanding the question, but given p_vec is already on the device and p_p_vec is aliased to p_vec, there’s no need for the unified memory to be copied again and why it doesn’t show up.

The actual data movement occurs during the first kernel call and why the first kernel is significantly longer than the others.

-Mat

m.gondrum · February 21, 2025, 5:17pm

Hi Mat,

we have here 4 kernel calls
Two init runs, one ‘aliased’ and one ‘non-aliased’.

The init runs do not matter - there is an upload in the first related to necessary data transfer.

However, the ‘aliased’ (p_p_vec) run performs as desired (you barley see it in the profiling). The ‘non-aliased’ is very poor, ~ half as long as the init run.
In this run also an HtoD appears which transfers less data than in the init run.

What data is here transferred? Why? And how to prevent?

As you said, this should not bee needed.

Best regards
Miro

MatColgrove · February 21, 2025, 6:21pm

Ok, so the question is more about why the second, non-aliased routine is getting additional data managed memory movement. Unfortunately, I’m not sure. I’m not able recreate this and, while the non-aliased kernels is slightly longer than the aliased ones, I’m not seeing any additional data movement. The second kernel is only taking 4352ns vs 284,800ns in the first.


 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     98.6          289,152          2  144,576.0  144,576.0     4,352   284,800    198,306.7  void cub::CUB_300000_SM_90_NVHPC::detail::for_each::static_kernel<cub::CUB_300000_SM_90_NVHPC::deta…
      1.4            4,096          2    2,048.0    2,048.0     1,984     2,112         90.5  void cub::CUB_300000_SM_90_NVHPC::detail::for_each::static_kernel<cub::CUB_300000_SM_90_NVHPC::deta…

It could be a driver problem since the CUDA driver manages the data and somehow sees the data as dirty. Or it could be an artifact of using a consumer device (I’m using an H100).

What driver version are you using? I don’t have access to an RTX 6000, but should be able to find a system with the same driver to see if my results change.

m.gondrum · February 24, 2025, 8:45am

On Grace hopper architecture is completely different and thus not comparable, is it?

According to nvidia-smi we have systems with two driver versions:
Driver Version: 570.86.15 CUDA Version: 12.8 and Driver Version: 565.57.01 CUDA Version: 12.7
yielding the same problem.

Additionally, I tested the problem on another machine
(JUWELS Booster)
equipped with the A100, thus ampere and cc80 in the above CXXFLAGS.
The driver version is Driver Version: 560.35.03 CUDA Version: 12.6.
The outcome is equivalent

juwels_report.nsys-rep (409.3 KB)

m.gondrum · February 25, 2025, 8:52am

Adding

  xtxRangePush("non-aliased 100x");
  for (int i = 0; i < 100; i++) {
    dc.dummyParallelFor();
  }
  xtxRangePop();

gives the following behavior

Here, the marked region (in green) shows, that after a certain time of calling the kernel no upload is necessary/triggered and the kernel performs as quick as in the previous ‘aliased’ version.

Why is the behavior different? Is it possible to avoid this delayed behavior?

MatColgrove · February 25, 2025, 8:43pm

I was able to reproduce the issue when I moved to using a system with Managed memory as opposed the full Unified memory system I was on before.

This made me think that it might have something to do with the static class object (dc) which causes the driver to see the class memory as “dirty” and hence move it each time. While this may be related, when I change the program to dynamically allocate “dc” so it gets put in managed memory, I still see the behavior.

I’m going to ask some folks in our STDPAR compiler team to take a look to see if they have ideas and will get back to you.

MatColgrove · February 27, 2025, 4:01pm

I was on the right track. I has to do with the implicit “this” pointer needing to be copied in. Here’s her explanation:

I can see the first for_each uses a class member inside the lambda:

  std::for_each(std::execution::par_unseq, m_index.begin(), m_index.end(),
                [=](const int& i) {
                  // dummy read and write pattern
                  if (p_vec[i] < 0.0) {
                    p_vec[i] = 2.0;
                  }
                });

This is equivalent to:

  std::for_each(std::execution::par_unseq, m_index.begin(), m_index.end(),
                [=](const int& i) {
                  // dummy read and write pattern
                  if (this->p_vec[i] < 0.0) {
                    this->p_vec[i] = 2.0;
                  }
                });

As you can see there is an extra implicit pointer indirections this->p_vec.

So what ends up being captured in that lambda is the object pointer this and this ends up being sent to GPU memory with cudaMalloc/cudaMemcpy because the lambda capture becomes the kernel parameter. This is I think where H2D originates from. When p_vec is actually being accessed in the lambda the address relative to this is being computed for p_vec in the unified memory and it is being loaded from there.

Perhaps then the CUDA driver decides to move this pointer to the GPU memory eventually to avoid remote reads from the CPU memory remotely. If we were to understand better what the unified memory management module is doing I would suggest adding -cuda-um-cpu-page-faults true --cuda-um-gpu-page-faults true --event-sample=system-wide --cpu-socket-events=61,71,265,273 --cpu-socket-metrics=103,104 flags for nsys to see more unified memory profiling information including page migrations, etc.

In the second for_each what ends up being captured is p_p_vec which has the same address as p_vec so that address is sent to GPU memory in the kernel argument via H2D transfer, After this pointer is simply being accessed in the lambda from the GPU without the need for any extra data transfers.

m.gondrum · February 28, 2025, 1:33pm

Thank you for ‘this’. Explicitly capturing this by value, i.e. using [=,*this] solved the problem for this this
Results on our system look now as following

system · April 3, 2025, 11:32am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Accelerating Standard C++ with GPUs Using stdpar Technical Blog	7	1390	July 28, 2023
Call to cuStreamSynchronize returned error 700: Illegal address during kernel execution nvc, nvc++ and nvfortran	14	2917	July 17, 2020
Nvc++ -stdpar functionality possible without single compilation unit? host linker? nvc, nvc++ and nvfortran	4	743	December 30, 2022
concurrent D2H+H2D transfers? CUDA Programming and Performance	5	2475	May 10, 2016
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8103	June 30, 2010
Nvc++: undefined __kmpc_for_static_init_16 and Unexpected branch type nvc, nvc++ and nvfortran	7	352	April 2, 2024
Using classes in openACC nvc, nvc++ and nvfortran	11	742	March 20, 2023
Bug in cudaMemsetAsync or in Nsight VS Edition when visualizing cudaMemsetAsync execution CUDA Programming and Performance	13	1394	November 11, 2021
Implicit data copy to device for allocated arrays using compilation option -stdpar=gpu nvc, nvc++ and nvfortran	11	685	May 31, 2023
LLVM Error when compiling C++ STD parallel execution policies to GPU nvc, nvc++ and nvfortran	9	464	May 2, 2024

Nvc++ 25.1, stdpar, HtoD for already fetched member variable

Related topics