Parallel STL algorithms may run sequentially with thrust::counting_iterator

christos.akrivopoulos · September 26, 2023, 8:24am

Hi,

I’m using HPC SDK 23.7 and compiling without -stdpar=multicore, but with -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB.

I’ve noticed that passing thrust::counting_iterators to parallel STL algorithms such as for_each, transform, inclusive_scan, and transform_inclusive_scan results in the algorithms running sequentially.

Example:

std::for_each(
  std::execution::par,
  thrust::make_counting_iterator(0),
  thrust::make_counting_iterator(1000000),
  [](int i) { /* do something */ }
);

The same algorithms run in parallel when using my own or oneAPI’s counting iterator implementation (see here). They also run in parallel when using -stdpar=multicore (except for scan algorithms).

NOTE: Thrust’s counting iterators are random access iterators, which is in agreement with this.

Any idea why this is so?

Regards,
Christos

MatColgrove · September 26, 2023, 4:07pm

Hi Christos,

I haven’t used Thrust counting iterators before but ported one of my codes, LULESH, to use it and it runs in parallel when targeting multicore. Not sure why it’s not for you.

Can you post a minimal reproducing example so I can investigate?

Thanks,
Mat

christos.akrivopoulos · September 27, 2023, 9:07am

Hi Mat,

Thanks for your reply.

Apologies, my description is not accurate. The problem happens without -stdpar=multicore and with -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB.

Can we delete this topic? I’ll study the matter more deeply and come back with a new topic.

Thanks,
Christos

MatColgrove · September 27, 2023, 5:17pm

No problem.

Can we delete this topic?

There’s a trash can icon you can use to delete topics, but I believe this just flags it and an admin needs to do the the actual deletion.

Though, there’s no issue with keep it.

christos.akrivopoulos · September 28, 2023, 8:28am

Hi Mat,

Thanks.

I’ve updated the title and description of the topic.

In brief, here’s the problem:

If I use -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB (or if I choose TBB as Thrust’s host system in CMake), without -stdpar=multicore, passing thrust::counting_iterator will result in algorithms running serially.

I am attaching a small program where you can see that. It performs a math operation (vector norm) on several vectors: with and without counting iterators. On my PCs, the version using counting iterators is slower, when I compile it like this:

nvc++ -O3 -fast -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_TBB -I/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/include/ test.cpp -o test -ltbb

test.txt (2.3 KB)

Thanks,
Christos

MatColgrove · September 28, 2023, 5:50pm

Hi Christos,

Since this sounds like an issue with Thrust or possibly TBB, you should consider reporting this on the Thrust GitHub:
Issues · NVIDIA/thrust · GitHub

-Mat

christos.akrivopoulos · September 29, 2023, 7:36am

Hi Mat,

That’s a good idea, thanks.

However, it might be worth noticing that directly calling thrust::transform in the example program in my previous post works as expected with counting iterators and TBB (i.e. the algorithm runs in parallel).

Cheers,
Christos

Topic		Replies	Views
Std::transform_inclusive_scan does not run in parallel with nvc++ -stdpar=multicore nvc, nvc++ and nvfortran	1	462	September 25, 2023
Thrust::async::for_each() with zip_iterators CUDA Programming and Performance	3	566	January 30, 2023
How to use thrust::async::for_each with cuda streams? CUDA Programming and Performance cuda	13	3833	May 12, 2021
Thrust and concurrent execution on multi-GPU CUDA Programming and Performance	1	1529	February 21, 2018
Parallelization not supported for std::inclusive_scan and std::transform_inclusive_scan with nvc++ -stdpar=multicore nvc, nvc++ and nvfortran	3	475	July 6, 2023
Thrust `zip_iterator` with arbitrary number of iterators CUDA Programming and Performance	8	275	September 3, 2024
How do I know if thrust::inclusive_scan is being run in parallel on the GPU? CUDA Programming and Performance	2	1818	June 4, 2014
Thrust with MPS CUDA Programming and Performance	2	350	June 25, 2021
Gather, Scatter and zip with -stdpar nvc, nvc++ and nvfortran	3	435	November 23, 2022
Passing thurst vector into kernel and pushing data into vector CUDA Programming and Performance	8	8029	January 2, 2018

Parallel STL algorithms may run sequentially with thrust::counting_iterator

Related topics