The Overhead of Streams and Events


Is there a place where one can read up on the limits to the number of streams and events in CUDA, as well as their impact on performance when creating large numbers of them?

Background: to my knowledge, the best way of running work concurrently on a GPU is to place it on different streams and to synchronize those streams by registering an event on one stream and synchronizing to that event on another. Sometimes, one uses a number of smaller pieces of memory instead of one large chunk.
Say I have 1000 images that need to be processed and a kernel can dispatch a batch of 10 of these images at a time. It would be a good idea to have 30 of them on the gpu at a time, where 10 are being uploaded, 10 are being downloaded and 10 are being downloaded. Now, uploading (or downloading) these images needs to happen in no particular order. So it is to question whether one should upload them on separate streams in hopes of saving some synchronization overhead. At least the profiler suggests that concurrency is a desirable thing here. Alternatively, the problem could be that the images need some separate pre-processing in 10 independent kernels. What if it’s not 10 images that are to be processed concurrently, but 20 or 50 or 100?

To be clear, I’m not looking to solve any one particular problem. Rather, understanding the overhead of streams and events would allow me to make more informed decisions. The questions splits into a multiple parts:

  • Do I understand correctly that having one stream wait for another is best done by registering an event and synchronizing to that event?
  • Is there some cost associated with registering an event that would become notable if done a sufficient number of times?
  • Is there some cost associated with synchronizing to an event?
  • Is there some cost associated with creating an event? How about creating a lot of events that are “alive” at the same time?
  • Is there some cost associated with registering an event?
  • Is there some cost associated with having a lot of streams active at the same time?
  • Is there some gain to be expected by running a couple of uploads/downloads concurrently rather than in sequence?
  • Is there some gain to be expected by running a couple of kernels concurrently rather than in sequence?

And whenever there is a cost/gain, how large is it and at what number does it “kick in” or become irrelevant. I know these are a lot of questions and there are probably more to be asked. So pointing towards a place to read up on those things will be fine, if there is one.


Not in my experience. For many GPUs which have at most a single copy engine per direction, only one transfer can be outstanding per direction at any given moment.

For some kernels, yes, for other kernels, no. Generally, if a kernel is large enough to “fill up” a GPU, then there is usually little or no benefit to attempting to run concurrent kernels. The exception to this would be the “tail effect” which you can google, but even this doesn’t usually account for more than a few percentage points of performance, in my experience.