Multiple Streams Performance

Sabaron · October 17, 2010, 1:56pm

Having some trouble understanding what the issue is when I use multiple streams. I’m sharing the same context between N threads. Each thread is receiving data over an ethernet socket and piping it to the GPU for processing, getting results, and sending it back out on the wire. The threads are completely independent of each other and each has it’s own stream to the GPU. When I’m benchmarking performance, I get severely worse performance as I go up from a single thread.

For example, a single thread was backing up on the GPU at about 300 MB / sec but 10 threads back up at about 170 MB / sec (these are totals across all threads). I don’t understand why the performance should be any different. E.g. 1 thread at 300 or 10 threads at 30 each for 300. I know there could be some slight overhead losses and mismatches, but only getting slightly over 50% performance of single stream?

Does a block on a stream completion cause a busy CPU wait? e.g. Could I be consuming lots of cycles with idle threads? Profiling the GPU seems identical, with multiple streams it is hopping between one to another. Speaking of which, what algorithm does it use to determine that? Stream waiting longest? A round robin approach?

I’m using multiple streams in anticipation of moving to a Fermi card where I can overlap some kernels, otherwise it’s really not helping me at the moment, so I could create my own queue in code and probably fix it that way. But that doesn’t help long term. The single context is mapped to a Tesla M1060.

Any ideas would be helpful, thanks!

Sabaron · October 17, 2010, 1:56pm

Having some trouble understanding what the issue is when I use multiple streams. I’m sharing the same context between N threads. Each thread is receiving data over an ethernet socket and piping it to the GPU for processing, getting results, and sending it back out on the wire. The threads are completely independent of each other and each has it’s own stream to the GPU. When I’m benchmarking performance, I get severely worse performance as I go up from a single thread.

For example, a single thread was backing up on the GPU at about 300 MB / sec but 10 threads back up at about 170 MB / sec (these are totals across all threads). I don’t understand why the performance should be any different. E.g. 1 thread at 300 or 10 threads at 30 each for 300. I know there could be some slight overhead losses and mismatches, but only getting slightly over 50% performance of single stream?

Does a block on a stream completion cause a busy CPU wait? e.g. Could I be consuming lots of cycles with idle threads? Profiling the GPU seems identical, with multiple streams it is hopping between one to another. Speaking of which, what algorithm does it use to determine that? Stream waiting longest? A round robin approach?

I’m using multiple streams in anticipation of moving to a Fermi card where I can overlap some kernels, otherwise it’s really not helping me at the moment, so I could create my own queue in code and probably fix it that way. But that doesn’t help long term. The single context is mapped to a Tesla M1060.

Any ideas would be helpful, thanks!

SPWorley · October 17, 2010, 5:53pm

Is your CPU pegged? Maybe with multiple threads you’re calling a lot of cudaStreamSynchronize() calls, and if you have more threads than cores, that’d start giving you scheduling stalls.

SPWorley · October 17, 2010, 5:53pm

Is your CPU pegged? Maybe with multiple threads you’re calling a lot of cudaStreamSynchronize() calls, and if you have more threads than cores, that’d start giving you scheduling stalls.

tmurray · October 17, 2010, 7:04pm

What OS?

tmurray · October 17, 2010, 7:04pm

What OS?

Sabaron · October 17, 2010, 7:56pm

This on RHEL 5.5. Cuda 3.2 installed, though I’ve seen this since I started this on 3.0. It’s a dual quad-core Xeon (with HT) so 16 processors show up.

Sabaron · October 17, 2010, 7:56pm

This on RHEL 5.5. Cuda 3.2 installed, though I’ve seen this since I started this on 3.0. It’s a dual quad-core Xeon (with HT) so 16 processors show up.

Sabaron · October 19, 2010, 2:38am

Along these lines, is there any way to call cuStreamSynchronize() without blocking all threads? e.g. 1 context shared by N threads each with their own stream. You have to provide a critical region where you push/pop the context in order to ensure only 1 thread gets the context as current at once. However if that thread has to wait on the stream to finish before continuing, it’ll effectively block all others since it can’t pop the context and unlock until returning from the synchronize call.

I think this is the underlying problem but don’t really know a way to solve it…short of a context for every thread but that is counter productive eliminating gains in fermi, sharing of common data, etc. What am I missing?

Sabaron · October 19, 2010, 2:38am

Along these lines, is there any way to call cuStreamSynchronize() without blocking all threads? e.g. 1 context shared by N threads each with their own stream. You have to provide a critical region where you push/pop the context in order to ensure only 1 thread gets the context as current at once. However if that thread has to wait on the stream to finish before continuing, it’ll effectively block all others since it can’t pop the context and unlock until returning from the synchronize call.

I think this is the underlying problem but don’t really know a way to solve it…short of a context for every thread but that is counter productive eliminating gains in fermi, sharing of common data, etc. What am I missing?

Topic		Replies	Views
Question on Stream, Connection and Performance CUDA Programming and Performance hw , cuda	6	1254	February 23, 2024
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1463	September 14, 2017
Using multiple streams with multiple host threads takes longer? stream CUDA Programming and Performance	3	900	February 10, 2021
My streams are not running concurrently CUDA Programming and Performance	7	1794	March 6, 2018
Multiple CPU threads with multiple cudaStreams CUDA Programming and Performance	5	6139	July 23, 2015
Kernel Functions Blocking Multithreaded Application? CUDA Programming and Performance	11	1114	October 12, 2021
Cannot get any stream parallelism. CUDA Programming and Performance	13	1299	December 31, 2019
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13132	March 30, 2011
Overhead of using more than one streams? CUDA Programming and Performance	5	6179	April 14, 2009
Using CUDA/CudaContexts simultanously from multiple CPU threads CUDA Programming and Performance	4	5454	February 3, 2010

Multiple Streams Performance

Related topics