About thrust::execution_policy when copying data from device to host

I use thrust::copy to transfer data from device to host in a multi-GPU system. Each GPU has a equally sized partition of the data. Using OpenMP, I call the function on each device. On my current system I am working on 4 GPUs.

#pragma omp parallel for
for (size_t i = 0; i < devices.size(); ++i) 
{
    const int device = devices[i];
    thrust::copy(thrust::device, // execution policy
                 device_buffers->At(device)->begin(), // thrust::device_vector
                 device_buffers->At(device)->end(),
                 elements->begin() + (device * block_size)); // thrust::host_vector
}

After reading the documentation and the following post, I understand that the default thrust::execution_policy is chosen based on the iterators that are passed.

  • When copying data from device to host, both iterators are passed as function parameters.

    1. Which execution policy is picked here per default? thrust::host or thrust::device?

  • After doing some benchmarks, I observe that passing thrust::device explicitly improves performance, compared to not passing an explicit parameter.
    2. What could be the reason for the performance gain? The system is a POWER9 machine. How does thrust::copy and the specific execution policy work internally? How many of the 4 copy engines of each device are actually used?

  • However, nvprof does not display the [CUDA memcpy DtoH] category
    anymore and instead shows void thrust::cuda_cub::core […] __parallel_for::ParallelForAgent […] which even shows an increase in Time (s). This does not make sense because, as I said, I observed a consistent performance improvement (smaller total execution time) when using thrust::device.

    3. Is this just a nvprof + thrust-specific behaviour that causes profiling numbers not to correlate with acutal execution time? I observed something similiar for cudaFree: It seems that cudaFree is
    returning control to the host code pretty fast which results in small execution time while nvprof shows much higher numbers because the actual deallocation probably happens in lazy fashion.