How to implement calculation pipeline via CUDA streams ?

Romant · January 13, 2013, 6:57pm

Hi,

Say, I have a kernel that does some simple math on one input array and produces one output array. Sizes of both input and output arrays are equal to 5 megabytes, each array contains single precision float values.

The number of input arrays is, say, 1000. What I want to do is to implement the “copy input array to device” → “run kernel on input array” → “copy results from device” pipeline that would handle all 1000 input arrays sequentially.

Obviously, it would be great to overlap the memcopies and calculations using CUDA streams. CUDA samples show how to run fixed number of operations using fixed number of streams without reloading streams with new work as soon as streams become idle. What I have in mind is to create two CUDA streams (one for memcopies and one for kernel runs) and handle all the input data with them but I still can’t understand how to do it using CUDA samples as a basis.

Have anybody of you guys implement something similar ? I would really appreciate any possible hints.

Thanks in advance.

tbenson · January 15, 2013, 5:54am

Hello,

If you have a kernel that consumes data transferred with a memcpy, then the memcpy and kernel should be on the same stream. Separate streams can execute independently, so the kernel could start before the memcpy if they are on separate streams. (More global synchronizations, such as with cudaDeviceSynchronize(), could prevent this.)

I believe that something like the following should work:

#define NUM_STREAMS 16

cudaStream_t streams[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; ++i) { cudaStreamCreate(&streams[i]); }

for (int i = 0; i < numArrays; ++i)
{
    const int curStream = i % NUM_STREAMS;
    cudaMemcpyAsync(dev_arrays + i*arraySize, host_arrays + i*arraysSize, sizeof(float)*arraySize, cudaMemcpyHostToDevice, streams[curStream]);
    YourKernel<<<grid,block,smemSize,streams[curStream]>>>(...);
    cudaMemcpyAsync(host_arrays + i*arraySize, dev_arrays + i*arraysSize, sizeof(float)*arraySize, cudaMemcpyDeviceToHost, streams[curStream]);
}

for (int i = 0; i < NUM_STREAMS; ++i)
{
    cudaStreamSynchronize(streams[i]);
    cudaStreamDestroy(streams[i]);
}

Error checking and some declarations are missing, but hopefully the intent is clear. You just round-robin dependent tasks into some number of streams. The host memory must be pinned for the async memcpy calls. You can adjust NUM_STREAMS for the architecture. GK110 added support for many more concurrent streams (see HyperQ). I’m not sure what the value should be fore Fermi, but probably just 1 or 2 (as I understand it, Fermi only has a single hardware queue).

HTH,
Thomas

Romant · January 16, 2013, 12:03pm

Hi Thomas,

Thank you very much for your help!

Great code snippet, I understand the idea. However, if I undertand right, this code assumes that both host and device memory is preallocated and the size of allocated blocks is big enough to fit all arrays.

The number of arrays can be rather big, overall amount of memory that is required to fit them all is about 5 gigabytes, only some Teslas have such an amount of RAM to say nothing of the fact that it would hardly be possible to pin such a great piece of RAM on the host.

How to deal with it ?

tbenson · January 17, 2013, 12:32am

Romant,

Well, successive kernels submitted on the same stream will be serialized with respect to the previous kernels in that stream, so it is safe to only use NUM_STREAMS device buffers. (The code would then change to dev_arrays + curStream*arraySize above.) That should take care of the device memory constraint – NUM_STREAMS can be adjusted to control the device memory usage.

For host memory, it will just depend how data is arriving into host memory and where it goes after processing. There is not much advantage to dumping many kernels into a stream, so you can always block the above code snippet to account for the number of available host arrays (i.e., numArrays would be BLOCK_SIZE * NUM_STREAMS, or smaller). Thus, you can control how much host memory needs to be locked.

Note that simply staging from unpinned host memory to pinned host memory (i.e., just memcpy()ing a piece of some larger data structure to the pinned buffer) just so you can use streams is not generally a performance win. However, if you are, say, populating the host buffers from a high-speed network connection, then the above should work. (Although you should also see if GPUDirect is an option in that case.)

HTH,
Thomas

Topic		Replies	Views
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13130	March 30, 2011
Streams and multiprocessor usage? CUDA Programming and Performance	3	2899	September 20, 2008
confusions about CUDA streams CUDA Programming and Performance	5	809	July 30, 2017
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25108	March 8, 2010
Transfer data between host and device dynamicly? Maybe it's a problem. CUDA Programming and Performance	12	5271	April 2, 2008
CUDA stream CUDA Programming and Performance	1	4651	April 11, 2010
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5548	April 28, 2012
Optimal Use of Streams? CUDA Programming and Performance	16	2292	August 11, 2010
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	887	July 7, 2017

How to implement calculation pipeline via CUDA streams ?

Related topics