Simple test of streams versus several launches Some performance questions

I must admit that I am pretty novice when it comes to streams, but I got them working and found some interesting results when testing how they performed compared to one big launch and several smaller launches with the same parameters as the streams.

I based my work on the simpleStreams example in the SDK, but I wanted to extend it a bit to use a different kernel, but with the same kind of access pattern. I also wanted to add a step of cudaMemcpy H2D.

My streams are setup like:
for(i=0; i < launches; i++)
…cudaMemcpyAsync(g_in + i*memory_size … H2D)

for(i=0; i < launches; i++)
…kernel<<<blocks, threads>>>(g_in + imemory_size, g_out + imemory_size)

for(i=0; i < launches; i++)
…cudaMemcpyAsync(out + i*memory_size … D2H)

I compared the runtime results with the following launch:

cudaMemcpy (g_in, memory_size * launches, H2D)
kernel<<<blocks * launches, threads >>>(g_in, g_out)
cudaMemcpy(out, g_out, memory_size * launches, D2H)

and one exactly the same as the first, just not async.

The results show no difference in approach 1 and 3, but both are a lot quicker than approach 2. I have benchmarked by using cudaEventElapsedTime, gettimeofday and the time command in linux. Is there something I am missing on how to use streams, or are the memory copies overlapped by the operating system in a similar fashion giving good performance?

Didn’t quite understand what are your 3 “approaches”, but the problems seems similar to mine:

The approaches are based on that there are differnet ways to process a set of data. Either do everything in one launch, without the async API, this means more blocks launched. Or it can be split up into different number of launches using the async API or not.