Thrust and concurrent execution on multi-GPU

I’m working on 2 K40 cards and use Thrust library. My program is showed below

for (int i = 0; i < device_count; ++i) {
	cudaSetDevice(i);
	thrust::remove_if(a_begin[i], a_end[i], is_not_zero());
}

My question are

  1. For this program, second card begin to execute after first finish?
  2. If yes, how to make concurrent execution for first and second cards?

Thank you

Some thrust algorithms can be entirely asynchronous, whereas some others involve some synchronous activity (such as device memory allocations). Thrust doesn’t document on a case-by-case basis whether an algorithm is fully asynchronous or not, and indeed it may change from one thrust version to the next.

So the easiest way to answer your first question might be to just try it.

For the second question, if there is something preventing full asynchronous behavior, there are often methods to work around it, which may vary depending on the exact issue. For example, thrust::reduce involves a final cudaMemcpy operation to copy the reduction value to from device to host. This is synchronizing. To avoid this you could switch to reduce_by_key. Many thrust algorithms effectively do a cudaMalloc operation, which may appear to be synchronizing. A way to avoid this is to provide your own custom allocator, such as described here:

https://stackoverflow.com/questions/48670284/cuda9-thrust-sort-by-key-overlayed-with-h2d-copy-using-streams/48671517#48671517

CUB often gives more granular control over these things than thrust does, so if you are interested in fully asynchronous behavior, in some cases CUB may be an easier starting point. This question may be of interest, as it combines these ideas with a CUB focus:

https://stackoverflow.com/questions/22476069/cub-select-if-with-returned-indexes