Copy back to host lasts much longer than copy to device, why?

Hi, I have written a little kernel and copied the data to the device and after the execution from the device back to host. My code looks like this (simplified):

//copy to operations
cudaMemcpy(d_values,values,size_values,cudaMemcpyHostToDevice);
cudaMemcpy(d_fitness,fitness,size_fitness,cudaMemcpyHostToDevice);

//kernel
compute<<<BLOCKNUMBER, THREADSIZE>>>(d_fitness, d_values,
				fRand(0, 100.0));
//copy back operations
cudaMemcpy((void*) fitness, d_fitness, size_fitness,cudaMemcpyDeviceToHost);
				
cudaMemcpy((void*) values, d_values, size_values,cudaMemcpyDeviceToHost);

If I measure the time with the copy operations I get 1.99143119 seconds.
If I measure the time without the copy back operations I get 0.617714564 seconds.
If i measure the time without the copy to and with the copy back operations I get 1.49729565 seconds.

So the copy back to host operations last much longer than all other operations, but why? I copy the same data back as I have copied TO the device. What is the problem?
Thank you very much in advance for your time and help!

Not any idea? Are here perhaps any synchronization issues?

When you remove the copy back operations, you are measuring the time of the first copy operations and the kernel launch, not the kernel duration since calling a kernel is asynchronous. You need to call the cudaDeviceSynchronize() function after you call the kernel to measure the execution time correctly. When you do include the copy back operations they will act as synchronization.

The other option is to use the Nvidia visual profiler or nvprof, they will give the execution time of each cudaMemcpy and kernel.

Thank you very much!! You are so right, thanks!!!