Why 2 parallel processes slower than 1 + 1

RezaRob3 · July 2, 2013, 3:37pm

I got a simple “latency test” which just launches a tiny kernel many times. I got 2 versions:
a.) Synchronize the command stream after each call.
b.) Synchronize the whole things(thousands of calls) at the end.

Case (a) takes 11 seconds to execute(1 run of the entire process(executable.))
So, for 2 BACK-TO-BACK runs, case (a) would take 23 seconds or so.

Now… I launch 2 runs of the process SIMULTANEOUSLY, at the same time, but it takes 55 seconds to finish! Why is that?! I thought it might take at most 23 seconds(as for the Back-to-Back case.)

This is on a Fermi card and cuda 5.0.

seibert · July 2, 2013, 6:08pm

Current GPUs are not very efficient at context switching between processes. Launching lots of tiny kernels from two simultaneous processes is probably the worst possible case, as you’ve found. The GPU driver spent half of its time switching contexts rather than running your kernels.

RezaRob3 · July 2, 2013, 8:09pm

Hmm… I see, and when I don’t synchronize the stream frequently, presumably several jobs from one process get batched together so the context doesn’t have to be switched too often.
That explains the performance difference for case (b)

Very good,
Thank you seibert.

Topic		Replies	Views
Multiple CPU threads Performance hit CUDA Programming and Performance	5	5503	February 28, 2008
Invoking kernel from multiple PC processes CUDA Programming and Performance	1	5555	June 3, 2011
Running multiple processes on a GPU cause it stuck CUDA Programming and Performance	5	2600	March 16, 2010
Multiple Streams Performance CUDA Programming and Performance	9	6634	October 19, 2010
Launching multiple kernels in same context vs multiple kernels CUDA Programming and Performance	5	4961	April 3, 2024
Synchronization methods? CUDA Programming and Performance	11	2302	November 7, 2010
High cuCtxSynchronize overhead CUDA Programming and Performance	0	703	November 5, 2012
Cuda streams vs Cuda+MPI How the different CPU processes access to the GPU? CUDA Programming and Performance	13	16217	March 20, 2011
Performances of multi-thread vs multi-process with MPS CUDA Programming and Performance	2	3217	August 20, 2018
cudaStream performance CUDA Programming and Performance	7	1743	June 21, 2016

Why 2 parallel processes slower than 1 + 1

Related topics