How many streams should I use for concurrent kernels?

I’ve tried to optimize my code without any great success, so I thought I would try to use concurrent kernels since I have a Fermi card. I’ve read about it in the programming guide but I still have some questions.

Do I have to use more than one stream to get concurrent kernel execution? How do I know how many streams to use, trial and error?

If I use two streams, I guess that this requires two times the memory?

In my case I have a few kernels that are very lightweight, they take 0.5 - 10 ms each, but I want to run them like 100 000 times. Each kernel works on 176 thread blocks. My guess is that some of the multiprocessors become unoccupied very quickly and wait for the new kernel to launch, with concurrent kernel execution they might be occupied all the time? The kernels have to be launched in a specific order.

for (int i = 0; i < 100000; i++)

{

	// Copy constants from host memory to constant memory, depends on i, 320 bytes is copied

// Launch kernel A

   // Launch kernel B

   // Launch kernel C

   // Launch kernel D

// Copy result to host, 4 bytes is copied

}

Would this fit for concurrent kernel execution?

The absolute maximum number of streams you need for max concurrency on GF100 is 18 (one per copy engine on Tesla, 16 for concurrent kernels).

Just a random data point…

The absolute maximum number of streams you need for max concurrency on GF100 is 18 (one per copy engine on Tesla, 16 for concurrent kernels).

Just a random data point…

Yeah but they need to work on separate memory spaces, right? So if one stream uses 100 MB, 18 streams will use 1,8 GB?

Yeah but they need to work on separate memory spaces, right? So if one stream uses 100 MB, 18 streams will use 1,8 GB?

Not necessarily. Is it possible to have terrible race conditions between kernels if you’re all reading and writing from the same space? Sure, don’t do that. If you have 16 kernels reading from one 100MB buffer, doing something with 16 different sets of parameters, and writing 1MB of output each? That’s 116MBs, not 1600MBs.

Not necessarily. Is it possible to have terrible race conditions between kernels if you’re all reading and writing from the same space? Sure, don’t do that. If you have 16 kernels reading from one 100MB buffer, doing something with 16 different sets of parameters, and writing 1MB of output each? That’s 116MBs, not 1600MBs.