Thrust and concurrent execution on multi-GPU

RemixmanInf · February 21, 2018, 9:17am

I’m working on 2 K40 cards and use Thrust library. My program is showed below

for (int i = 0; i < device_count; ++i) {
	cudaSetDevice(i);
	thrust::remove_if(a_begin[i], a_end[i], is_not_zero());
}

My question are

For this program, second card begin to execute after first finish?
If yes, how to make concurrent execution for first and second cards?

Thank you

Robert_Crovella · February 21, 2018, 2:13pm

Some thrust algorithms can be entirely asynchronous, whereas some others involve some synchronous activity (such as device memory allocations). Thrust doesn’t document on a case-by-case basis whether an algorithm is fully asynchronous or not, and indeed it may change from one thrust version to the next.

So the easiest way to answer your first question might be to just try it.

For the second question, if there is something preventing full asynchronous behavior, there are often methods to work around it, which may vary depending on the exact issue. For example, thrust::reduce involves a final cudaMemcpy operation to copy the reduction value to from device to host. This is synchronizing. To avoid this you could switch to reduce_by_key. Many thrust algorithms effectively do a cudaMalloc operation, which may appear to be synchronizing. A way to avoid this is to provide your own custom allocator, such as described here:

[url]c++ - cuda9 + thrust sort_by_key overlayed with H2D copy (using streams) - Stack Overflow

CUB often gives more granular control over these things than thrust does, so if you are interested in fully asynchronous behavior, in some cases CUB may be an easier starting point. This question may be of interest, as it combines these ideas with a CUB focus:

[url]c++ - CUB select if with returned indexes - Stack Overflow

Topic		Replies	Views
How to use thrust::async::for_each with cuda streams? CUDA Programming and Performance cuda	13	4175	May 12, 2021
Please help me understand some issues regarding concurrent kernel execution CUDA Programming and Performance	5	655	July 23, 2019
Thrust: Concurrency and Kernels CUDA Programming and Performance	3	929	June 12, 2023
Is thrust::copy synchrous or asynchronus? GPU-Accelerated Libraries	2	4071	August 11, 2015
Dumb question but do I need to synchronize after Thrust calls? CUDA Programming and Performance	2	2336	October 9, 2016
Using thrust::cuda::par with thrust::cuda::par.on CUDA Programming and Performance	3	4169	August 21, 2019
Thrust v1.0 release A high-level C++ template library for CUDA CUDA Programming and Performance	11	16907	May 30, 2009
Thrust `__host__` side and `__device__` side behavior CUDA Programming and Performance cuda	3	991	January 20, 2024
Selecting cuda device before launching thrust::reduce CUDA Programming and Performance	1	555	June 19, 2017
Async thrust operation launches appear serially processed in nsight systems CUDA Programming and Performance cuda , performance , parallel-computing	5	404	August 29, 2024

Thrust and concurrent execution on multi-GPU

Related topics