Dear NVIDIA team,
using the c++ parallel algorithms of stdpar in NVHPC SDK 25.1 (and previous) an additional upload is triggered when using a pointer being a class member.
However, no upload is triggered when aliasing this pointer locally.
Interestingly, the observed HtoD cannot be related to the actual pointed memory as this would require more transfer as can be seen in an ‘init run’ we perform to trigger upload.
Is there a way to get this working?
Our multi-physics code relays on such accessor structure and explicitly creating a point is not an everywhere an option.
Best regards
Miro
The minimum working example:
#include <algorithm>
#include <execution>
#include <vector>
#include <nvtx3/nvToolsExt.h>
class DummyClass {
public:
DummyClass(const int size) : m_index(size), p_vec(new double[size]) {
for (int i = 0; i < size; i++) m_index[i] = i;
};
void dummyParallelFor();
void dummyParallelForAliased();
private: // same problem for public, too.
double* const p_vec;
std::vector<int> m_index; // dummy index
};
inline void DummyClass::dummyParallelFor() {
std::for_each(std::execution::par_unseq, m_index.begin(), m_index.end(),
[=](const int& i) {
// dummy read and write pattern
if (p_vec[i] < 0.0) {
p_vec[i] = 2.0;
}
});
};
inline void DummyClass::dummyParallelForAliased() {
double* p_p_vec = p_vec; // the 'aliasing'
std::for_each(std::execution::par_unseq, m_index.begin(), m_index.end(),
[=](const int& i) {
if (p_p_vec[i] < 0.0) {
p_p_vec[i] = 2.0;
}
});
};
int main() {
constexpr int size = 40000; // any 'large' number
DummyClass dc(size);
// To trigger the sync HtoD
nvtxRangePushA("init run : aliased");
dc.dummyParallelFor();
nvtxRangePop();
nvtxRangePushA("init run : aliased");
dc.dummyParallelForAliased();
nvtxRangePop();
// Relevant profiling
nvtxRangePushA("aliased");
dc.dummyParallelForAliased();
nvtxRangePop();
nvtxRangePushA("non-aliased");
dc.dummyParallelFor();
nvtxRangePop();
}
The compile instructions used on a
Intel Xeon W-2155 with an NVIDIA Quadro RTX 6000
CXX="/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/bin/nvc++"
CXXFLAGS=" -O2 -std=c++17 -stdpar=gpu -gpu=cc75,mem:managed,cuda12.6 "
$CXX $CXXFLAGS main.cpp -o main
Profiling call
nsys profile --trace=cuda,nvtx,mpi --force-overwrite=true --output=report.nsys-rep ./main
The outcome
report.nsys-rep (299.2 KB)
Hi Miro,
Perhaps I’m not quite understanding the question, but given p_vec is already on the device and p_p_vec is aliased to p_vec, there’s no need for the unified memory to be copied again and why it doesn’t show up.
The actual data movement occurs during the first kernel call and why the first kernel is significantly longer than the others.
-Mat
Hi Mat,
we have here 4 kernel calls
Two init runs, one ‘aliased’ and one ‘non-aliased’.
The init runs do not matter - there is an upload in the first related to necessary data transfer.
However, the ‘aliased’ (p_p_vec) run performs as desired (you barley see it in the profiling). The ‘non-aliased’ is very poor, ~ half as long as the init run.
In this run also an HtoD appears which transfers less data than in the init run.
What data is here transferred? Why? And how to prevent?
As you said, this should not bee needed.
Best regards
Miro
Ok, so the question is more about why the second, non-aliased routine is getting additional data managed memory movement. Unfortunately, I’m not sure. I’m not able recreate this and, while the non-aliased kernels is slightly longer than the aliased ones, I’m not seeing any additional data movement. The second kernel is only taking 4352ns vs 284,800ns in the first.
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
98.6 289,152 2 144,576.0 144,576.0 4,352 284,800 198,306.7 void cub::CUB_300000_SM_90_NVHPC::detail::for_each::static_kernel<cub::CUB_300000_SM_90_NVHPC::deta…
1.4 4,096 2 2,048.0 2,048.0 1,984 2,112 90.5 void cub::CUB_300000_SM_90_NVHPC::detail::for_each::static_kernel<cub::CUB_300000_SM_90_NVHPC::deta…
It could be a driver problem since the CUDA driver manages the data and somehow sees the data as dirty. Or it could be an artifact of using a consumer device (I’m using an H100).
What driver version are you using? I don’t have access to an RTX 6000, but should be able to find a system with the same driver to see if my results change.
On Grace hopper architecture is completely different and thus not comparable, is it?
According to nvidia-smi
we have systems with two driver versions:
Driver Version: 570.86.15 CUDA Version: 12.8
and Driver Version: 565.57.01 CUDA Version: 12.7
yielding the same problem.
Additionally, I tested the problem on another machine
(JUWELS Booster)
equipped with the A100
, thus ampere and cc80
in the above CXXFLAGS
.
The driver version is Driver Version: 560.35.03 CUDA Version: 12.6
.
The outcome is equivalent
juwels_report.nsys-rep (409.3 KB)
Adding
xtxRangePush("non-aliased 100x");
for (int i = 0; i < 100; i++) {
dc.dummyParallelFor();
}
xtxRangePop();
gives the following behavior
Here, the marked region (in green) shows, that after a certain time of calling the kernel no upload is necessary/triggered and the kernel performs as quick as in the previous ‘aliased’ version.
Why is the behavior different? Is it possible to avoid this delayed behavior?
I was able to reproduce the issue when I moved to using a system with Managed memory as opposed the full Unified memory system I was on before.
This made me think that it might have something to do with the static class object (dc) which causes the driver to see the class memory as “dirty” and hence move it each time. While this may be related, when I change the program to dynamically allocate “dc” so it gets put in managed memory, I still see the behavior.
I’m going to ask some folks in our STDPAR compiler team to take a look to see if they have ideas and will get back to you.
1 Like
I was on the right track. I has to do with the implicit “this” pointer needing to be copied in. Here’s her explanation:
I can see the first for_each uses a class member inside the lambda:
std::for_each(std::execution::par_unseq, m_index.begin(), m_index.end(),
[=](const int& i) {
// dummy read and write pattern
if (p_vec[i] < 0.0) {
p_vec[i] = 2.0;
}
});
This is equivalent to:
std::for_each(std::execution::par_unseq, m_index.begin(), m_index.end(),
[=](const int& i) {
// dummy read and write pattern
if (this->p_vec[i] < 0.0) {
this->p_vec[i] = 2.0;
}
});
As you can see there is an extra implicit pointer indirections this->p_vec.
So what ends up being captured in that lambda is the object pointer this and this ends up being sent to GPU memory with cudaMalloc/cudaMemcpy because the lambda capture becomes the kernel parameter. This is I think where H2D originates from. When p_vec is actually being accessed in the lambda the address relative to this is being computed for p_vec in the unified memory and it is being loaded from there.
Perhaps then the CUDA driver decides to move this pointer to the GPU memory eventually to avoid remote reads from the CPU memory remotely. If we were to understand better what the unified memory management module is doing I would suggest adding -cuda-um-cpu-page-faults true --cuda-um-gpu-page-faults true --event-sample=system-wide --cpu-socket-events=61,71,265,273 --cpu-socket-metrics=103,104
flags for nsys to see more unified memory profiling information including page migrations, etc.
In the second for_each what ends up being captured is p_p_vec which has the same address as p_vec so that address is sent to GPU memory in the kernel argument via H2D transfer, After this pointer is simply being accessed in the lambda from the GPU without the need for any extra data transfers.
Thank you for ‘this’. Explicitly capturing this
by value, i.e. using [=,*this]
solved the problem for this this
Results on our system look now as following