I am looking to copy some data from a large vector (order 10^6 values) to a smaller one (order 10^2). I was considering a couple ways to do it. I could either copy the values directly from the host using Thrust (my values are stored in device vectors anyway) or I could create a vector of all elements to move, then invoke a single kernel where each thread copies a value over.
I was wondering about the performance aspects of each approach. Will I incur appreciable overhead using Thrust to copy the values? No matter what this operation will be memory bandwidth limited, but what is the fastest approach to copy some values (not all of which are adjacent) from a large vector to a smaller one?
Thanks.