Hi, I have written a little kernel and copied the data to the device and after the execution from the device back to host. My code looks like this (simplified):
//copy to operations
cudaMemcpy(d_values,values,size_values,cudaMemcpyHostToDevice);
cudaMemcpy(d_fitness,fitness,size_fitness,cudaMemcpyHostToDevice);
//kernel
compute<<<BLOCKNUMBER, THREADSIZE>>>(d_fitness, d_values,
fRand(0, 100.0));
//copy back operations
cudaMemcpy((void*) fitness, d_fitness, size_fitness,cudaMemcpyDeviceToHost);
cudaMemcpy((void*) values, d_values, size_values,cudaMemcpyDeviceToHost);
If I measure the time with the copy operations I get 1.99143119 seconds.
If I measure the time without the copy back operations I get 0.617714564 seconds.
If i measure the time without the copy to and with the copy back operations I get 1.49729565 seconds.
So the copy back to host operations last much longer than all other operations, but why? I copy the same data back as I have copied TO the device. What is the problem?
Thank you very much in advance for your time and help!