There is a sample in the SDK (concurrentKernels) that demonstrates how to run a bunch of kernels (each one in it’s own stream) and then wait until all streams will report the finish event, however, this sample does not show the technique of continuous kernel restarts as long as one of kernels from the initial bunch is finished.
For example, if I run 16 kernels concurrently I’d like to track all of them and when, say, stream number 3 is finished I’d like to run one more kernel in stream 3 rather than wait until all 16 streams will do the job and then rerun the whole bunch. This approach can be useful when, for example, it is necessary to handle 1 million abstract data items launching one kernel for each.
This task can be easily implemented on CPU: create 16 worker threads, each thread takes the data item to handle from the input queue and as long as the queue becomes empty all worker threads do finish. How to do the same thing on Fermi-based GPU using concurrency feature ?
Thanks in advance.