Hi,

Say, I have a kernel that does some simple math on one input array and produces one output array. Sizes of both input and output arrays are equal to 5 megabytes, each array contains single precision float values.

The number of input arrays is, say, 1000. What I want to do is to implement the “copy input array to device” -> “run kernel on input array” -> “copy results from device” pipeline that would handle all 1000 input arrays sequentially.

Obviously, it would be great to overlap the memcopies and calculations using CUDA streams. CUDA samples show how to run fixed number of operations using fixed number of streams without reloading streams with new work as soon as streams become idle. What I have in mind is to create two CUDA streams (one for memcopies and one for kernel runs) and handle all the input data with them but I still can’t understand how to do it using CUDA samples as a basis.

Have anybody of you guys implement something similar ? I would really appreciate any possible hints.

Thanks in advance.