Std::transform_inclusive_scan does not run in parallel with nvc++ -stdpar=multicore


I created this topic a while ago, because nvc++ issued “parallelisation not supported” warnings with the overloads of std::transform_inclusive_scan taking an initial value.

I get no such warnings after I have switched to the overloads without initial values.

However, the algorithm does not run in parallel with -stdpar=multicore. I’m using the latest NVIDIA HPC SDK with nvc++ 23.7.0 64-bit target on x86-64 Linux -tp haswell.

I have yet to do any serious profiling, but I wrote a small program that collects CPU ID information of every thread used by STL algorithms such as std::for_each, std::transform, and std::transform_inclusive_scan (no initial value argument).

I set the number of threads via OMP_NUM_THREADS, and see that std::for_each and std::transform run on multiple threads. However, this is not the case with std::transform_inclusive_scan, which always runs on a single thread.

I wonder if this is a known issue or if I’m missing anything.
I can share the (silly) test program - just let me know if you want me to.



  • I am using the OpenMP host backend (which I believe is the default).
  • Besides my CPU ID mini app, I have also run scaling experiments on an HPC system and noticed no parallel speedup with std::transform_inclusive_scan (no initial value overload).

I got the scan algorithms to run in parallel by compiling without -stdpar=multicore, enabling the TBB backend of Thrust (NVIDIA’s parallel STL backbone - see here and here), and linking against TBB.