About thrust::execution_policy when copying data from device to host

ivan.ilic · September 6, 2020, 4:41pm

I use thrust::copy to transfer data from device to host in a multi-GPU system. Each GPU has a equally sized partition of the data. Using OpenMP, I call the function on each device. On my current system I am working on 4 GPUs.

#pragma omp parallel for
for (size_t i = 0; i < devices.size(); ++i) 
{
    const int device = devices[i];
    thrust::copy(thrust::device, // execution policy
                 device_buffers->At(device)->begin(), // thrust::device_vector
                 device_buffers->At(device)->end(),
                 elements->begin() + (device * block_size)); // thrust::host_vector
}

After reading the documentation and the following post, I understand that the default thrust::execution_policy is chosen based on the iterators that are passed.

When copying data from device to host, both iterators are passed as function parameters.

1. Which execution policy is picked here per default? thrust::host or thrust::device?
After doing some benchmarks, I observe that passing thrust::device explicitly improves performance, compared to not passing an explicit parameter.
2. What could be the reason for the performance gain? The system is a POWER9 machine. How does thrust::copy and the specific execution policy work internally? How many of the 4 copy engines of each device are actually used?
However, nvprof does not display the [CUDA memcpy DtoH] category
anymore and instead shows void thrust::cuda_cub::core […] __parallel_for::ParallelForAgent […] which even shows an increase in Time (s). This does not make sense because, as I said, I observed a consistent performance improvement (smaller total execution time) when using thrust::device.

3. Is this just a nvprof + thrust-specific behaviour that causes profiling numbers not to correlate with acutal execution time? I observed something similiar for cudaFree: It seems that cudaFree is
returning control to the host code pretty fast which results in small execution time while nvprof shows much higher numbers because the actual deallocation probably happens in lazy fashion.

Topic		Replies	Views
Segmentation fault for thrust::host execution policy GPU-Accelerated Libraries cuda	1	651	September 14, 2020
Default Thrust execution policy CUDA Programming and Performance	1	2327	June 19, 2017
Thrust `__host__` side and `__device__` side behavior CUDA Programming and Performance cuda	3	991	January 20, 2024
Thrust: Concurrency and Kernels CUDA Programming and Performance	3	929	June 12, 2023
thrust copy_if function is slow on gpu data CUDA Programming and Performance	2	1950	June 10, 2019
Thrust::copy_if on multiple range CUDA Programming and Performance cuda	3	1590	July 31, 2021
Is thrust::copy synchrous or asynchronus? GPU-Accelerated Libraries	2	4071	August 11, 2015
Using thrust::cuda::par with thrust::cuda::par.on CUDA Programming and Performance	3	4169	August 21, 2019
Device to device copying with Thrust CUDA Programming and Performance	1	659	August 27, 2015
thrust::copy failed in Xavier, but success in GTX1080 Jetson AGX Xavier	2	1626	October 18, 2021

About thrust::execution_policy when copying data from device to host

Related topics