Device to device copying with Thrust

I am looking to copy some data from a large vector (order 10^6 values) to a smaller one (order 10^2). I was considering a couple ways to do it. I could either copy the values directly from the host using Thrust (my values are stored in device vectors anyway) or I could create a vector of all elements to move, then invoke a single kernel where each thread copies a value over.

I was wondering about the performance aspects of each approach. Will I incur appreciable overhead using Thrust to copy the values? No matter what this operation will be memory bandwidth limited, but what is the fastest approach to copy some values (not all of which are adjacent) from a large vector to a smaller one?


Has anyone benchmarked copying vectors one element at a time with thrust versus a single kernel call?

EDIT: I wrote an algorithm to test it myself. For one million elements, copying one at a time takes 3.5 sec on my machine versus 0.7 millisec using thrust::copy. Guess I will need to create a vector and invoke a single kernel for the copy operation.