Optimal Use of Streams?

Hi,

I haven’t actually used streams before, but I’m looking to try and speed up an existing CUDA solution using them. I’ve read the relevant section in the programming guide, but am a little unsure as to what extent they can be used. Is there a notable overhead in using streams?

Let’s say I have an array of 100,000 elements stored in host memory and I wish to copy it to device memory, perform a calculation, then copy the result back. How many streams should I use? Is there a suggested tradeoff between number of streams and ability to overlap transfer and compute?

Thanks,
Dan

The example you describe has no operations that can run simultaneously (since each step depends on the previous), so you only need one stream. Prior to Fermi, multiple streams were most useful for cases where you could operate on one chunk of data while sending the next. This usually required some kind of double buffering. In your example, if you had many 100,000 element arrays to process, you could use one stream to send array 2 while the other stream processed array 1, and so on.

Now with Fermi, there are a few additional uses for streams. If you have several kernels to run independently, but they might not all make full usage of the GPU, by launching them in different streams, you give the scheduler the option of using blocks from different kernels to keep the GPU busy. And with the Tesla C2050/70 cards, the DMA engine can perform a host to device transfer at the same time as a device to host transfer. So again, in the case where you had many arrays to process, you could actually do triple buffering, where you process on one array, while filling a second array and retrieving results from a third.

The key to using streams in any case is having a program with independent kernel launches and/or memory transfers. Without that, you don’t need streams.

The example you describe has no operations that can run simultaneously (since each step depends on the previous), so you only need one stream. Prior to Fermi, multiple streams were most useful for cases where you could operate on one chunk of data while sending the next. This usually required some kind of double buffering. In your example, if you had many 100,000 element arrays to process, you could use one stream to send array 2 while the other stream processed array 1, and so on.

Now with Fermi, there are a few additional uses for streams. If you have several kernels to run independently, but they might not all make full usage of the GPU, by launching them in different streams, you give the scheduler the option of using blocks from different kernels to keep the GPU busy. And with the Tesla C2050/70 cards, the DMA engine can perform a host to device transfer at the same time as a device to host transfer. So again, in the case where you had many arrays to process, you could actually do triple buffering, where you process on one array, while filling a second array and retrieving results from a third.

The key to using streams in any case is having a program with independent kernel launches and/or memory transfers. Without that, you don’t need streams.

In that particular example, each block (and indeed thread) could be executed independently. For example using a simple kernel like this:

__global__ void add(int* a, int* b, int* result)

{			

	int idx = blockDim.x * blockIdx.x + threadIdx.x;

	

	result[idx] = a[idx] + b[idx];

}

Surely this would benefit from batching the memory transfer and computation of the array within streams? As soon as a small segment of memory had been transferred the computation could start, instead of waiting for the entire array of memory to transfer? Clearly, this will have limited impact for such a trivial example, but on a much larger scale I’d like to know about how to decide how many streams to split this batching into?

In that particular example, each block (and indeed thread) could be executed independently. For example using a simple kernel like this:

__global__ void add(int* a, int* b, int* result)

{			

	int idx = blockDim.x * blockIdx.x + threadIdx.x;

	

	result[idx] = a[idx] + b[idx];

}

Surely this would benefit from batching the memory transfer and computation of the array within streams? As soon as a small segment of memory had been transferred the computation could start, instead of waiting for the entire array of memory to transfer? Clearly, this will have limited impact for such a trivial example, but on a much larger scale I’d like to know about how to decide how many streams to split this batching into?

Ah, ok. You didn’t say what calculation you were doing with the array, so I didn’t assume it could be factorized into several kernels. :)

Well, if you are looking for a win by overlapping computation and transfer, then in principle (assuming you can balance things well) you only need two for pre-Fermi devices. They can only run a kernel overlapping with one memory transfer. Tesla-series Fermi devices could benefit from 3 streams, so you can do each transfer direction + computation at the same time.

For overlapping small kernel calculations, you can potentially go up to 16 now, since that’s the max number of kernels Fermi can concurrently run. This case doesn’t apply to your example, though.

Ah, ok. You didn’t say what calculation you were doing with the array, so I didn’t assume it could be factorized into several kernels. :)

Well, if you are looking for a win by overlapping computation and transfer, then in principle (assuming you can balance things well) you only need two for pre-Fermi devices. They can only run a kernel overlapping with one memory transfer. Tesla-series Fermi devices could benefit from 3 streams, so you can do each transfer direction + computation at the same time.

For overlapping small kernel calculations, you can potentially go up to 16 now, since that’s the max number of kernels Fermi can concurrently run. This case doesn’t apply to your example, though.

Ok, that makes sense. However, streams are essentially just a sequential ordering of parallel tasks right? So, at the point at which the initial memory transfer is completed, it will be correctly overlapped (assuming we have enough computation to match the transfer time). That suggests that we will still have an unavoidable overhead for the first memory transfer before we can do any calculation? What I’m suggesting is that by reducing the size of the chunks of memory to be transferred (ie by splitting up the tasks into more streams) we can reduce this initial overhead?

I may have misunderstood how this works, but it seems logical that the memory transfer can’t be entirely hidden in the computation time?

Ok, that makes sense. However, streams are essentially just a sequential ordering of parallel tasks right? So, at the point at which the initial memory transfer is completed, it will be correctly overlapped (assuming we have enough computation to match the transfer time). That suggests that we will still have an unavoidable overhead for the first memory transfer before we can do any calculation? What I’m suggesting is that by reducing the size of the chunks of memory to be transferred (ie by splitting up the tasks into more streams) we can reduce this initial overhead?

I may have misunderstood how this works, but it seems logical that the memory transfer can’t be entirely hidden in the computation time?

Correct, streams are only going to be useful if you have many operations to perform in a row, in which case that initial priming step is negligible.

Correct, streams are only going to be useful if you have many operations to perform in a row, in which case that initial priming step is negligible.

Surely it will still be useful in overlapping some of the transfer and compute even if there’s only a small number of operations to be performed in a row?

Thank you very much for your input so far, you’re confirming what I’ve deduced so far, but you’re still not really answering my question! :)

I’d like to know about the tradeoff in reducing the overhead of that initial memory transfer by using more streams and incurring a maintenance penalty of having so many streams to process at once? Is there generally a method or formula on how to choose the number of streams?

In my original example, I could use 100,000 streams, one per block, so the initial priming step will disappear pretty fast, but I very much doubt this will even be possible! So how do I choose a sensible number?

Surely it will still be useful in overlapping some of the transfer and compute even if there’s only a small number of operations to be performed in a row?

Thank you very much for your input so far, you’re confirming what I’ve deduced so far, but you’re still not really answering my question! :)

I’d like to know about the tradeoff in reducing the overhead of that initial memory transfer by using more streams and incurring a maintenance penalty of having so many streams to process at once? Is there generally a method or formula on how to choose the number of streams?

In my original example, I could use 100,000 streams, one per block, so the initial priming step will disappear pretty fast, but I very much doubt this will even be possible! So how do I choose a sensible number?

To be honest, I have no experience taking a single kernel and speeding it up with stream usage. My typical usage pattern involves loading a large amount of data up front and then operating on it many, many times. I’ve never had a kernel limited by PCI-Express bandwidth, which is the situation where the kind of solution we are discussing is sensible.

You are unlikely to find an accurate formula for this (or any other CUDA performance metrics), so the best approach is to make it easy to change this parameter in your code and benchmark a range of possibilities.

One heuristic to note is that as you subdivide your problem into smaller and smaller kernels, at some point you underutilize the GPU and lose performance. A good rough target is 30-50% occupancy, though benchmarking is important here as well. For example, 30% occupancy on a GTX 285 means that you want 10,000 active threads. In that case, cutting your hypothetical 100,000 element array into more than 10 pieces is likely to have a negative effect.

To be honest, I have no experience taking a single kernel and speeding it up with stream usage. My typical usage pattern involves loading a large amount of data up front and then operating on it many, many times. I’ve never had a kernel limited by PCI-Express bandwidth, which is the situation where the kind of solution we are discussing is sensible.

You are unlikely to find an accurate formula for this (or any other CUDA performance metrics), so the best approach is to make it easy to change this parameter in your code and benchmark a range of possibilities.

One heuristic to note is that as you subdivide your problem into smaller and smaller kernels, at some point you underutilize the GPU and lose performance. A good rough target is 30-50% occupancy, though benchmarking is important here as well. For example, 30% occupancy on a GTX 285 means that you want 10,000 active threads. In that case, cutting your hypothetical 100,000 element array into more than 10 pieces is likely to have a negative effect.

Ok, thanks yeah that’s useful. In actual fact, the problem I’m trying to get round is where I’m working with extremely large data sets, such that I can’t fit them entirely into GPU memory (even on a C1060). So, I’m really looking at a way of batching up the data so that I can transfer and compute portions of the data and theoretically work with inifinite data sets. It seemed like streams was the most likely avenue to go down, though I’m also looking at the feasibilty of using mapped memory (in places the memory is not accessed using coalescing so this may not work so well).

It’s not really so much about the speed, but overlapping transfer and compute will still help with this. There are quite a few problems with doing this on these data sets, so I’m unsure as to how well this is going to work, but it sounds like it’s going to be worth investigating further, so I’ll just start working on an implementation and benchmark it to tune it appropriately.

Thanks for your help.

Ok, thanks yeah that’s useful. In actual fact, the problem I’m trying to get round is where I’m working with extremely large data sets, such that I can’t fit them entirely into GPU memory (even on a C1060). So, I’m really looking at a way of batching up the data so that I can transfer and compute portions of the data and theoretically work with inifinite data sets. It seemed like streams was the most likely avenue to go down, though I’m also looking at the feasibilty of using mapped memory (in places the memory is not accessed using coalescing so this may not work so well).

It’s not really so much about the speed, but overlapping transfer and compute will still help with this. There are quite a few problems with doing this on these data sets, so I’m unsure as to how well this is going to work, but it sounds like it’s going to be worth investigating further, so I’ll just start working on an implementation and benchmark it to tune it appropriately.

Thanks for your help.