Why 2 parallel processes slower than 1 + 1

I got a simple “latency test” which just launches a tiny kernel many times. I got 2 versions:
a.) Synchronize the command stream after each call.
b.) Synchronize the whole things(thousands of calls) at the end.

Case (a) takes 11 seconds to execute(1 run of the entire process(executable.))
So, for 2 BACK-TO-BACK runs, case (a) would take 23 seconds or so.

Now… I launch 2 runs of the process SIMULTANEOUSLY, at the same time, but it takes 55 seconds to finish! Why is that?! I thought it might take at most 23 seconds(as for the Back-to-Back case.)

This is on a Fermi card and cuda 5.0.

Current GPUs are not very efficient at context switching between processes. Launching lots of tiny kernels from two simultaneous processes is probably the worst possible case, as you’ve found. The GPU driver spent half of its time switching contexts rather than running your kernels.

Hmm… I see, and when I don’t synchronize the stream frequently, presumably several jobs from one process get batched together so the context doesn’t have to be switched too often.
That explains the performance difference for case (b)

Very good,
Thank you seibert.