Device to Host memcpy How do i make this faster?

Hi all

could anyone let me know what im doing wrong here, but i seem to be getting realy slow transfer speeds from the device to the host.

The program basicaly writes some data into 2 arrays, but i then need to get the data back to the host.

So im curently using the following code to do this

CUDA_SAFE_CALL(cudaMemcpy(Liquid,Liquidd,memsize,cudaMemcpyDeviceToHost));

CUDA_SAFE_CALL(cudaMemcpy(Crystal,Crystald,memsize,cudaMemcpyDeviceToHost));

So the size of the data i am copying is, 2048x2048x sizeof(int) and im transfering 2 of them.

Now if i just run the program, i get a 4.75 second pause, whilst the program executes those 2 lines of code, but if i pause before them, with a scan statement for example, the delay is reduced.

Any ideas of how i can do this a bit quicker, cause as far as i can see im transfering 8meg of data, and its taking nearly 5 seconds.

I,m prety sure the bandwidth isnt 2mb/s Device to Host, so i guess im doing something wrong.

Thanks

Mark

I guess the delay you get, is your kernel still executing. The kernel call is asynchronous you know.

– Kuisma

Ah that would explain it all.

Thanks for the quick responce.