I’d kindly appreciate a general advice or scheme of better implementation of the following task:
- All the source data goes to GPU prior to any kernel call as the data is not too big.
- Each kernel call means evaluation of portion of source data and storage of results in GPU memory with subsequent copy to main system RAM.
- Kernels are called one by one, as soon as the previous kernel is finished the next kernel is run.
- CPU waits in a separate thread for a single kernel completion and handles the data the kernel produced, as soon as the kernel is finished CPU handles the data and waits for the next kernel.
- It seems like streams are suitable solution, but it is not clear how to use them inside a loop {run kernel - acquire results - run kernel …}, are the stream objects reusable or it is necessary to recreate them before each kernel call e t c.
Also, may be SDK contains a sample that does something similar ? The main idea is to make the GPU working continuously without delays on CPU post processing - GPU works continuously and CPU handles the results as soon as they are produced in a separate thread.
If GPU works faster then CPU then GPU results are queued. CPU must not be a bottleneck for the GPU in any case.
Thanks in advance!