I’m using HPC SDK 23.7 and compiling without -stdpar=multicore, but with -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB.
I’ve noticed that passing thrust::counting_iterators to parallel STL algorithms such as for_each, transform, inclusive_scan, and transform_inclusive_scan results in the algorithms running sequentially.
The same algorithms run in parallel when using my own or oneAPI’s counting iterator implementation (see here). They also run in parallel when using -stdpar=multicore (except for scan algorithms).
NOTE: Thrust’s counting iterators are random access iterators, which is in agreement with this.
I haven’t used Thrust counting iterators before but ported one of my codes, LULESH, to use it and it runs in parallel when targeting multicore. Not sure why it’s not for you.
Can you post a minimal reproducing example so I can investigate?
I’ve updated the title and description of the topic.
In brief, here’s the problem:
If I use -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB (or if I choose TBB as Thrust’s host system in CMake), without -stdpar=multicore, passing thrust::counting_iterator will result in algorithms running serially.
I am attaching a small program where you can see that. It performs a math operation (vector norm) on several vectors: with and without counting iterators. On my PCs, the version using counting iterators is slower, when I compile it like this:
nvc++ -O3 -fast -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB -I/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/include/ test.cpp -o test -ltbb
However, it might be worth noticing that directly calling thrust::transform in the example program in my previous post works as expected with counting iterators and TBB (i.e. the algorithm runs in parallel).