I got a simple “latency test” which just launches a tiny kernel many times. I got 2 versions:

a.) Synchronize the command stream after each call.

b.) Synchronize the whole things(thousands of calls) at the end.

Case (a) takes 11 seconds to execute(1 run of the entire process(executable.))

So, for 2 BACK-TO-BACK runs, case (a) would take 23 seconds or so.

Now… I launch 2 runs of the process SIMULTANEOUSLY, at the same time, but it takes 55 seconds to finish! Why is that?! I thought it might take at most 23 seconds(as for the Back-to-Back case.)

This is on a Fermi card and cuda 5.0.