I’m using HPC SDK 23.7 and compiling without -stdpar=multicore, but with -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB.
I’ve noticed that passing thrust::counting_iterators to parallel STL algorithms such as for_each, transform, inclusive_scan, and transform_inclusive_scan results in the algorithms running sequentially.
I’ve updated the title and description of the topic.
In brief, here’s the problem:
If I use -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB (or if I choose TBB as Thrust’s host system in CMake), without -stdpar=multicore, passing thrust::counting_iterator will result in algorithms running serially.
I am attaching a small program where you can see that. It performs a math operation (vector norm) on several vectors: with and without counting iterators. On my PCs, the version using counting iterators is slower, when I compile it like this:
nvc++ -O3 -fast -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB -I/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/include/ test.cpp -o test -ltbb
However, it might be worth noticing that directly calling thrust::transform in the example program in my previous post works as expected with counting iterators and TBB (i.e. the algorithm runs in parallel).