I am trying to parallelize a simulation program that runs on HPC clusters. The goal is to be able to run it either sequentially on a single core, in parallel on multiple CPU cores or in parallel on a single Nvidia GPU. For these reasons, the code should be ideally in “standard c++”.
I tried a “proof of concept” program with nvc++ -stdpar and with thrust. The thrust program seems to work well, but it needs the smart iterators (counting iterator, zip iterator) and the following algorithms :
- thrust::gather
- thrust::scatter
I cannot figure any way to reproduce the gather and scatter algorithms using the parallel standard algorithms. The counting_iterator can be replaced by c++20 ranges::views::iota, and it seems that the zip iterator could be replaced by c++23 ranges::views::zip (but not implemented yet it seems). Is there a way to write gather, scatter and zip operations in standard c++ ?
I wrote a minimal working example with thrust that first gathers values from a value vector, compare them and then scatters an output on selected values. Keep in mind that the full program is more complex (an agent-based model with random interactions) and the vectors will have millions of elements.
minimal.cpp (3.8 KB)
Hi Maxime,
I’m not aware of a direct translation for gather/scatter into stdpar. You’ll most likely need to use a “for_each_n” construct, but it may not be as performant.
-Mat
Hi,
I figured out how to translate my minimal example with stdpar using std::transform and std::for_each to replace the gather/scatter operations. However, I had a problem with capturing the vectors from which the values are gathered and to which the values are scattered. I resolved the situation by creating shared pointers of vectors instead of vectors in order to be able to capture the shared_ptr by copy in the lambda functions. Capturing the vector themselves by reference gave a compilation warning and run-time error as the vectors were stack-allocated (even if the elements themselves are heap-allocated).
Is there a simpler way to capture a vector in the lambda function of a parallel algorithm ?
minimal2.cpp (3.5 KB)
Oh,
I just realized that if I use the data() member function of a std::vector, it returns a pointer to the beginning of a c-style array, which is located into the heap memory. I can therefore capture the pointer to the array beginning as a value (array address) in the lambda function, and then use the bracket operator to gather/scatter elements.
std::vector values{1,2,3,4,5,6,7,8};
std::transform(
std::execution::par,
input.begin(), input.end(),
output.begin(),
[p_values = values.data()](int key) {
return p_values[key];
}
);
I now need to test with a real program if the performance is similar using this technique as it is with thrust gather/scatter algorithms.
minimal3.cpp (3.2 KB)