Say, I have a kernel that does some simple math on one input array and produces one output array. Sizes of both input and output arrays are equal to 5 megabytes, each array contains single precision float values.
The number of input arrays is, say, 1000. What I want to do is to implement the “copy input array to device” → “run kernel on input array” → “copy results from device” pipeline that would handle all 1000 input arrays sequentially.
Obviously, it would be great to overlap the memcopies and calculations using CUDA streams. CUDA samples show how to run fixed number of operations using fixed number of streams without reloading streams with new work as soon as streams become idle. What I have in mind is to create two CUDA streams (one for memcopies and one for kernel runs) and handle all the input data with them but I still can’t understand how to do it using CUDA samples as a basis.
Have anybody of you guys implement something similar ? I would really appreciate any possible hints.
If you have a kernel that consumes data transferred with a memcpy, then the memcpy and kernel should be on the same stream. Separate streams can execute independently, so the kernel could start before the memcpy if they are on separate streams. (More global synchronizations, such as with cudaDeviceSynchronize(), could prevent this.)
I believe that something like the following should work:
#define NUM_STREAMS 16
cudaStream_t streams[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; ++i) { cudaStreamCreate(&streams[i]); }
for (int i = 0; i < numArrays; ++i)
{
const int curStream = i % NUM_STREAMS;
cudaMemcpyAsync(dev_arrays + i*arraySize, host_arrays + i*arraysSize, sizeof(float)*arraySize, cudaMemcpyHostToDevice, streams[curStream]);
YourKernel<<<grid,block,smemSize,streams[curStream]>>>(...);
cudaMemcpyAsync(host_arrays + i*arraySize, dev_arrays + i*arraysSize, sizeof(float)*arraySize, cudaMemcpyDeviceToHost, streams[curStream]);
}
for (int i = 0; i < NUM_STREAMS; ++i)
{
cudaStreamSynchronize(streams[i]);
cudaStreamDestroy(streams[i]);
}
Error checking and some declarations are missing, but hopefully the intent is clear. You just round-robin dependent tasks into some number of streams. The host memory must be pinned for the async memcpy calls. You can adjust NUM_STREAMS for the architecture. GK110 added support for many more concurrent streams (see HyperQ). I’m not sure what the value should be fore Fermi, but probably just 1 or 2 (as I understand it, Fermi only has a single hardware queue).
Great code snippet, I understand the idea. However, if I undertand right, this code assumes that both host and device memory is preallocated and the size of allocated blocks is big enough to fit all arrays.
The number of arrays can be rather big, overall amount of memory that is required to fit them all is about 5 gigabytes, only some Teslas have such an amount of RAM to say nothing of the fact that it would hardly be possible to pin such a great piece of RAM on the host.
Well, successive kernels submitted on the same stream will be serialized with respect to the previous kernels in that stream, so it is safe to only use NUM_STREAMS device buffers. (The code would then change to dev_arrays + curStream*arraySize above.) That should take care of the device memory constraint – NUM_STREAMS can be adjusted to control the device memory usage.
For host memory, it will just depend how data is arriving into host memory and where it goes after processing. There is not much advantage to dumping many kernels into a stream, so you can always block the above code snippet to account for the number of available host arrays (i.e., numArrays would be BLOCK_SIZE * NUM_STREAMS, or smaller). Thus, you can control how much host memory needs to be locked.
Note that simply staging from unpinned host memory to pinned host memory (i.e., just memcpy()ing a piece of some larger data structure to the pinned buffer) just so you can use streams is not generally a performance win. However, if you are, say, populating the host buffers from a high-speed network connection, then the above should work. (Although you should also see if GPUDirect is an option in that case.)