Multiple Streams Performance

Having some trouble understanding what the issue is when I use multiple streams. I’m sharing the same context between N threads. Each thread is receiving data over an ethernet socket and piping it to the GPU for processing, getting results, and sending it back out on the wire. The threads are completely independent of each other and each has it’s own stream to the GPU. When I’m benchmarking performance, I get severely worse performance as I go up from a single thread.

For example, a single thread was backing up on the GPU at about 300 MB / sec but 10 threads back up at about 170 MB / sec (these are totals across all threads). I don’t understand why the performance should be any different. E.g. 1 thread at 300 or 10 threads at 30 each for 300. I know there could be some slight overhead losses and mismatches, but only getting slightly over 50% performance of single stream?

Does a block on a stream completion cause a busy CPU wait? e.g. Could I be consuming lots of cycles with idle threads? Profiling the GPU seems identical, with multiple streams it is hopping between one to another. Speaking of which, what algorithm does it use to determine that? Stream waiting longest? A round robin approach?

I’m using multiple streams in anticipation of moving to a Fermi card where I can overlap some kernels, otherwise it’s really not helping me at the moment, so I could create my own queue in code and probably fix it that way. But that doesn’t help long term. The single context is mapped to a Tesla M1060.

Any ideas would be helpful, thanks!

Having some trouble understanding what the issue is when I use multiple streams. I’m sharing the same context between N threads. Each thread is receiving data over an ethernet socket and piping it to the GPU for processing, getting results, and sending it back out on the wire. The threads are completely independent of each other and each has it’s own stream to the GPU. When I’m benchmarking performance, I get severely worse performance as I go up from a single thread.

For example, a single thread was backing up on the GPU at about 300 MB / sec but 10 threads back up at about 170 MB / sec (these are totals across all threads). I don’t understand why the performance should be any different. E.g. 1 thread at 300 or 10 threads at 30 each for 300. I know there could be some slight overhead losses and mismatches, but only getting slightly over 50% performance of single stream?

Does a block on a stream completion cause a busy CPU wait? e.g. Could I be consuming lots of cycles with idle threads? Profiling the GPU seems identical, with multiple streams it is hopping between one to another. Speaking of which, what algorithm does it use to determine that? Stream waiting longest? A round robin approach?

I’m using multiple streams in anticipation of moving to a Fermi card where I can overlap some kernels, otherwise it’s really not helping me at the moment, so I could create my own queue in code and probably fix it that way. But that doesn’t help long term. The single context is mapped to a Tesla M1060.

Any ideas would be helpful, thanks!

Is your CPU pegged? Maybe with multiple threads you’re calling a lot of cudaStreamSynchronize() calls, and if you have more threads than cores, that’d start giving you scheduling stalls.

Is your CPU pegged? Maybe with multiple threads you’re calling a lot of cudaStreamSynchronize() calls, and if you have more threads than cores, that’d start giving you scheduling stalls.

What OS?

What OS?

This on RHEL 5.5. Cuda 3.2 installed, though I’ve seen this since I started this on 3.0. It’s a dual quad-core Xeon (with HT) so 16 processors show up.

This on RHEL 5.5. Cuda 3.2 installed, though I’ve seen this since I started this on 3.0. It’s a dual quad-core Xeon (with HT) so 16 processors show up.

Along these lines, is there any way to call cuStreamSynchronize() without blocking all threads? e.g. 1 context shared by N threads each with their own stream. You have to provide a critical region where you push/pop the context in order to ensure only 1 thread gets the context as current at once. However if that thread has to wait on the stream to finish before continuing, it’ll effectively block all others since it can’t pop the context and unlock until returning from the synchronize call.

I think this is the underlying problem but don’t really know a way to solve it…short of a context for every thread but that is counter productive eliminating gains in fermi, sharing of common data, etc. What am I missing?

Along these lines, is there any way to call cuStreamSynchronize() without blocking all threads? e.g. 1 context shared by N threads each with their own stream. You have to provide a critical region where you push/pop the context in order to ensure only 1 thread gets the context as current at once. However if that thread has to wait on the stream to finish before continuing, it’ll effectively block all others since it can’t pop the context and unlock until returning from the synchronize call.

I think this is the underlying problem but don’t really know a way to solve it…short of a context for every thread but that is counter productive eliminating gains in fermi, sharing of common data, etc. What am I missing?