async memcpy only seems to overlap device->host

[font=“Lucida Console”]As nearly as I can tell, the cudaMemcpyAsync function only overlaps on GPU operations if the direction is gpu->host.

As an example, I can run the simpleStream example as written and I get output like this:

non-streamed: 28.86

	2 streams   : 18.16

For a nice speedup :-)

If I then insert a line like this:

cudaMemcpy(d_a, a, 0, cudaMemcpyHostToDevice);

in the ‘time non-streamed execution for reference’ block

and an equivalent:

for (int i = 0; i < nstreams; ++i)

		cudaMemcpyAsync(d_a, a, 0 cudaMemcpyHostToDevice, streams[i]);

in the ‘time execution with nstreams streams’ block, I get output like this:

non-streamed: 41.01

	2 streams:	42.68

This is not quite what I was expecting :-( The slowdown doesn’t bother me,

but the fact that the async is no longer providing any speedup does.

Is there any way to load from CPU->GPU in an overlapped manner? I was

kinda counting on being able to do this to keep my processing pipeline busy …

-Mark Roulo[/font]