How to measure the performance of Thrust

I wrote a program to sort the strings, but cannot measure the exact performance of each Thrust function. The partial code is as below:

double d(timeval * begin, timeval * end) {
return (end->tv_sec - begin->tv_sec) * 1000.0 + (end->tv_usec - begin->tv_usec) / 1000.0;
gettimeofday(&t1, 0);
thrust::device_vector TheString_D(size_input);
thrust::copy(TheString_H, TheString_H+size_input, TheString_D.begin());
gettimeofday(&t2, 0);
thrust::sort(TheString_D.begin(), TheString_D.end());
gettimeofday(&t3, 0);
thrust::copy(TheString_D.begin(), TheString_D.end(), TheString_H);
gettimeofday(&t4, 0);
printf("%f %f %f\n", d(&t1, &t2), d(&t2, &t3), d(&t3, &t4));

The result is
5642.147000 10.240000 0.295000

That is a very wierd result, the last copy function is just too fast, and the sort is also quite fast while the first part is very slow. Seems asymchronize covered the last two Thrust functions.So how could I get the exact performance of each Thrust function? I tried adding “cudaDeviceSynchronize()” or “cudaThreadSynchronize()” after each Thrust function but they didn’t help.