Std::inclusive_scan crashes with transform view with stdpar

Hi,

Please consider the following program:

// scan.cpp
#include <iostream>
#include <execution>
#include <numeric>
#include <ranges>
#include <vector>
#include <experimental/mdspan>
using std::experimental::mdspan;

int main()
{
	constexpr int N = 1'000'000;

	// in/out data
	std::vector<double> u(N);
	std::vector<double> v(N);

	// initialise input data
	std::iota(u.begin(), u.end(), 0);

	// working OK
	{
		std::transform_inclusive_scan(
			std::execution::par_unseq,
			u.begin(), u.end(),
			v.begin(),
			std::plus<double>{},
			[](double x) { return x - 1.0; }
		);

		std::cout << v.back() << std::endl;
	}

	// do the same with ranges (crashes)
	{
		// input (transform) range
		auto r = std::views::iota(0, N) | std::views::transform([mu = mdspan{u.data(), N}](int i) { return mu[i] - 1.0; });

		std::inclusive_scan(
			std::execution::par_unseq,
			r.begin(), r.end(),
			v.begin(),
			std::plus<double>{}
		);

		std::cout << v.back() << std::endl;
	}

	return 0;
}

I compile the program like this:

nvc++ -std=c++20 -stdpar -Minfo=stdpar scan.cpp -o scan

The second scan emulates the first by passing a transform range to inclusive_scan; it crashes (segmentation fault) when I compile it with nvc++ 23.9-0 (I’ve also tried older versions).

Initially, I thought the problem was the mdspan, as it’s a host memory view; however, the program still crashes when I replace std::vector with thrust::device_vector:

thrust::device_vector<double> u(N);
// ...
auto r = std::views::iota(0, N) | std::views::transform([mu = mdspan{thrust::raw_pointer_cast(u.data()), N}](int i) { return mu[i] - 1.0; });

std::inclusive_scan(/*as before*/);

I’ve seen the crash on a Linux HPC system with NVIDIA A100 and on WSL2 on a workstation with NVIDIA GeForce RTX 3080 Ti.

Please note the following:

  • The program also crashes when the output iterator is a transform view iterator.
  • transform_inclusive_scan also crashes when given transform view iterators.

Can you please tell me why that is? I thought C++ ranges can be freely used with stdpar algorithms.

Thanks,
Christos

1 Like

Thanks Christos for reporting this. I have filed an internal problem report (35411) for this.

1 Like

Hello,

Has this issue been resolved in the latest NVHPC SDK?

Regards,
Christos

Hi Christos,

I looked at the issue report and engineering did take a look, but it appears this is something that can’t be fixed, at least not by the compiler . Here’s what they wrote:

std::ranges::transform_view doesn’t work reliably in parallel algorithms. transform_view’s iterators contain a pointer to the transform_view object. The iterators are used in device code, but the transform_view object lives on the CPU stack. The illegal memory access is happening when the transform_view iterator dereferences the pointer to the transform_view so it can access the function object to be called. I think the problem only happens when the lambda passed to transform captures something. When the lambda doesn’t have any capture, the pointer from the iterator to the transform_view object is never actually dereferenced, so the illegal memory access doesn’t happen.

This is a limitation of stdpar that can’t easily be fixed. There are some type that don’t work in GPU code because of internal pointers like this. cartesian_product_view and zip_view have the same problem

One workaround is to use CUDA Unified Memory with Heterogeneous Memory Management (HMM). This allows for the GPU to access the CPUs static memory, including the stack, and hence the transform_view object can be accessed.

Unified memory is default on NVIDIA Grace-Hopper systems using NVLink. On x86, managed memory is the default so unified needs to be enabled via a flag. The flag name changed in 24.5 to “-gpu=mem:unified” from “-gpu=unified” in previous releases.

On x86 you will need to enable HMM which requires NVIDIA CUDA 12.2 with the open-source r535_00 driver or newer, an OS with a recent Linux kernel: 6.1.24+, 6.2.11+, or 6.3+, and a NVIDIA Turing, NVIDIA Ampere, NVIDIA Ada Lovelace, NVIDIA Hopper, or newer GPU architecture.

Full details on HMM can be found at:

-Mat

Hi Mat,

Thank you for the detailed explanation; everything makes sense.

Regards,
Christos

@MatColgrove I just ran into the same issue with transform_view and a capture. But zip_view seems to work flawlessly. So under which conditions do zip_view and cartesian_product_view not work? Or am I maybe using a different version of libstdc++ (I’m using GNU 13.1 and compiling in C++23 mode)?

I personally don’t have experience with either of these so would need to need to ask engineering. Though they did say that the problem only occurs when the lambda is capturing something due to the object being on the stack. Are you capturing something and it still works?

Those two views don’t take custom functors/lambdas, so that remark was only regarding transform_view. They could still be implemented s.t. the iterators reference the view objects for data, but at least for zip_view with two underlying ranges it seems to work with my setup. I have not tested cartesian_product_view, but from Carthesian Product error with nvc++ I was under the impression that that would work as well (maybe there was HMM involved though).

I talked with the engineer who wrote the above statement. He didn’t recall why he put zip_view there since they don’t need internal pointers, but stated:

zip_view iterators can be problematic, but for a different reason. Dereferencing a zip_view iterator returns a proxy type, not the value_type. The proxy type behaves correctly in most cases, but for some algorithms (at least std::sort , I’m not sure which others) the proxy type leads to compilation errors.

For cartesian_product he said:

The cartesian_product implementation that comes with GCC has the internal pointer that triggers the illegal memory access failure. There are other cartesian_product implementations out there that don’t have internal pointers. I don’t know what implementation was being used in the UF issue that was linked from this UF issue.

Let me know if you have additional questions.

1 Like