Update:
I implemented asynchronous copying using a stream (hopefully correctly!?).
I made a cudaStream_t, used cudaMallocHost() for the host array (as opposed to regular malloc()), used cudaMemcpyAsync() (as oppsed to just cudaMemCpy()) and passed the stream to the kernel calls and cudaMemCpyAsync().
[quote name=‘SPWorley’ date=‘Apr 6 2010, 01:10 PM’ post=‘1034639’]
So you’d set up a stream, and queue up your two kernels, then your ASYNCHRONOUS memory copy back to host, then your “copy done” event, then the two kernels again. Your CPU keeps polling for the mem transfer completion and prints the status as soon as it’s ready… and likely then inserting the next memcopy. event signal, and 2 kernel launches again.
[/code]
I didn’t have any events or re-queue the kernels, as you suggest here SPWorley. Have I done this incorrectly?
Meanwhile, here are some results:
Printing every 100th step (F=100)
– non-async = 103s
– with-async = 93s
Performance increase: 10.75%
Printing every 10th step (F=10):
– non-async = 213s
– with-async = 123s
Performance increase: 73.17%
Printing every step (F=1)
– non-async = 567s
– with-async = 516s
Performance increase: 9.88% [Note: HDD capacity reached here so that may have something to do with this odd result]