Parallel STL algorithms may run sequentially with thrust::counting_iterator

Hi,

I’m using HPC SDK 23.7 and compiling without -stdpar=multicore, but with -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB.

I’ve noticed that passing thrust::counting_iterators to parallel STL algorithms such as for_each, transform, inclusive_scan, and transform_inclusive_scan results in the algorithms running sequentially.

Example:

std::for_each(
  std::execution::par,
  thrust::make_counting_iterator(0),
  thrust::make_counting_iterator(1000000),
  [](int i) { /* do something */ }
);

The same algorithms run in parallel when using my own or oneAPI’s counting iterator implementation (see here). They also run in parallel when using -stdpar=multicore (except for scan algorithms).

NOTE: Thrust’s counting iterators are random access iterators, which is in agreement with this.

Any idea why this is so?

Regards,
Christos

Hi Christos,

I haven’t used Thrust counting iterators before but ported one of my codes, LULESH, to use it and it runs in parallel when targeting multicore. Not sure why it’s not for you.

Can you post a minimal reproducing example so I can investigate?

Thanks,
Mat

Hi Mat,

Thanks for your reply.

Apologies, my description is not accurate. The problem happens without -stdpar=multicore and with -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB.

Can we delete this topic? I’ll study the matter more deeply and come back with a new topic.

Thanks,
Christos

No problem.

Can we delete this topic?

There’s a trash can icon you can use to delete topics, but I believe this just flags it and an admin needs to do the the actual deletion.

Though, there’s no issue with keep it.

Hi Mat,

Thanks.

I’ve updated the title and description of the topic.

In brief, here’s the problem:

If I use -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB (or if I choose TBB as Thrust’s host system in CMake), without -stdpar=multicore, passing thrust::counting_iterator will result in algorithms running serially.

I am attaching a small program where you can see that. It performs a math operation (vector norm) on several vectors: with and without counting iterators. On my PCs, the version using counting iterators is slower, when I compile it like this:

nvc++ -O3 -fast -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB -I/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/include/ test.cpp -o test -ltbb

test.txt (2.3 KB)

Thanks,
Christos

Hi Christos,

Since this sounds like an issue with Thrust or possibly TBB, you should consider reporting this on the Thrust GitHub:
Issues · NVIDIA/thrust · GitHub

-Mat

Hi Mat,

That’s a good idea, thanks.

However, it might be worth noticing that directly calling thrust::transform in the example program in my previous post works as expected with counting iterators and TBB (i.e. the algorithm runs in parallel).

Cheers,
Christos