How to implement calc. pipeline with streams ?

I’d kindly appreciate a general advice or scheme of better implementation of the following task:

  1. All the source data goes to GPU prior to any kernel call as the data is not too big.
  2. Each kernel call means evaluation of portion of source data and storage of results in GPU memory with subsequent copy to main system RAM.
  3. Kernels are called one by one, as soon as the previous kernel is finished the next kernel is run.
  4. CPU waits in a separate thread for a single kernel completion and handles the data the kernel produced, as soon as the kernel is finished CPU handles the data and waits for the next kernel.
  5. It seems like streams are suitable solution, but it is not clear how to use them inside a loop {run kernel - acquire results - run kernel …}, are the stream objects reusable or it is necessary to recreate them before each kernel call e t c.

Also, may be SDK contains a sample that does something similar ? The main idea is to make the GPU working continuously without delays on CPU post processing - GPU works continuously and CPU handles the results as soon as they are produced in a separate thread.

If GPU works faster then CPU then GPU results are queued. CPU must not be a bottleneck for the GPU in any case.

Thanks in advance!