[font=“Lucida Console”]As nearly as I can tell, the cudaMemcpyAsync function only overlaps on GPU operations if the direction is gpu->host.
As an example, I can run the simpleStream example as written and I get output like this:
non-streamed: 28.86 2 streams : 18.16
For a nice speedup :-)
If I then insert a line like this:
cudaMemcpy(d_a, a, 0, cudaMemcpyHostToDevice);
in the ‘time non-streamed execution for reference’ block
and an equivalent:
for (int i = 0; i < nstreams; ++i) cudaMemcpyAsync(d_a, a, 0 cudaMemcpyHostToDevice, streams[i]);
in the ‘time execution with nstreams streams’ block, I get output like this:
non-streamed: 41.01 2 streams: 42.68
This is not quite what I was expecting :-( The slowdown doesn’t bother me,
but the fact that the async is no longer providing any speedup does.
Is there any way to load from CPU->GPU in an overlapped manner? I was
kinda counting on being able to do this to keep my processing pipeline busy …