Having some trouble understanding what the issue is when I use multiple streams. I’m sharing the same context between N threads. Each thread is receiving data over an ethernet socket and piping it to the GPU for processing, getting results, and sending it back out on the wire. The threads are completely independent of each other and each has it’s own stream to the GPU. When I’m benchmarking performance, I get severely worse performance as I go up from a single thread.
For example, a single thread was backing up on the GPU at about 300 MB / sec but 10 threads back up at about 170 MB / sec (these are totals across all threads). I don’t understand why the performance should be any different. E.g. 1 thread at 300 or 10 threads at 30 each for 300. I know there could be some slight overhead losses and mismatches, but only getting slightly over 50% performance of single stream?
Does a block on a stream completion cause a busy CPU wait? e.g. Could I be consuming lots of cycles with idle threads? Profiling the GPU seems identical, with multiple streams it is hopping between one to another. Speaking of which, what algorithm does it use to determine that? Stream waiting longest? A round robin approach?
I’m using multiple streams in anticipation of moving to a Fermi card where I can overlap some kernels, otherwise it’s really not helping me at the moment, so I could create my own queue in code and probably fix it that way. But that doesn’t help long term. The single context is mapped to a Tesla M1060.
Any ideas would be helpful, thanks!