Times spend on Transferring data Something wrong?

I got a problem.I test the time spent on copying data from host to device using timer,although there is a big batch of datas,it cost only less than 10 (ms).what make me confused is the time spent on copying data back is more than 600(ms),only several hundreds bytes data. And the kernel only cost 0.06(ms),What’s happen?I think 600(ms) is the time spent on the kenerl,and the 0.06(ms)is the time for launching the kernel.Am right?

I think I’m right.Add the cudaThreadSynchronize() after the kernel,the result is true.