Question on Stream, Connection and Performance

,

I have a question regarding the relationship between the number of CUDA streams and CUDA_DEVICE_MAX_CONNECTIONS. Although not explicitly documented in CUDA resources, this topic is relevant in our community discussions. After thoroughly researching relevant posts, I’ve gathered insights and aim to verify my understanding.

To start, it’s generally accepted that each CUDA context does not have a hard limit on the number of streams. However, there’s a practical constraint imposed by CUDA_DEVICE_MAX_CONNECTIONS, which is up to 32. This limitation affects the true underlying independent parallelism between streams. If the number of streams exceeds CUDA_DEVICE_MAX_CONNECTIONS, kernels from different streams may still be stalled and ordered, potentially leading to false dependencies and errors.

First, I seek clarification on a specific aspect of this statement:

  • the distinction between “connections” and “HW queues.” Are these terms synonymous? I perceive “connections” as a software concept, referring to the connection between the CUDA driver on the host and the GPU device, while “HW queues” may represent physical hardware components with finite capacities.
  • Are both resources limited per CUDA context?

My primary question is maximizing performance but besides stream creation and connection creation times. I define performance as the ability to launch a large number of independent kernels but with identical launch parameters. I contemplate two feasible approaches:

  • Approach 1: Set CUDA_DEVICE_MAX_CONNECTIONS to 32 and create 32 streams, with the hope that each stream has an individual connection and queue slot. Then, launch kernels on these streams. Kernels from different streams are unordered, but within a stream, they adhere to CUDA stream semantics.
  • Approach 2: Similarly, set CUDA_DEVICE_MAX_CONNECTIONS to 32 but create hundreds or thousands of streams, allowing CUDA to schedule them. Within each stream, launch kernels independently (though they may not execute concurrently in reality).

Which approach appears more favorable to performance? Additionally, assuming a single CUDA context with numerous kernels, and considering that recent GPUs feature over 100 streaming multiprocessors (SMs), if CUDA_DEVICE_MAX_CONNECTIONS is set to 32 and 32 streams are created, should I carefully manage the number of blocks per kernel to ensure that all 32 streams effectively utilize the SMs concurrently? Because I am also thinking why CUDA_DEVICE_MAX_CONNECTIONS` has a default value of 8, but not the max.


Apologies for the extensive question, but I hope I’ve effectively communicated my queries. My inquiries stem from exploring CUDA streams, connections, and MPS, and I struggle to integrate all these concepts into a cohesive understanding of streams, connections, HW queues, and SMs.


Relevant Posts:

https://forums.developer.nvidia.com/t/do-asynchronous-activities-issued-to-different-streams-share-the-same-queue/

https://forums.developer.nvidia.com/t/concurrent-kernel-and-events-on-kepler

1 Like

I doubt there will be much difference. You can try both ways to be sure.

When I am teaching CUDA, I often point out that there is rarely need for more than 3-4 streams per device. The idea that it should ever be necessary to have hundreds or thousands of streams is completely foreign to me. The only reason I can imagine that anyone would do that is simplicity of coding, i.e. lack of attention, or lack of applying additional effort to the problem. Stream creation is not free. Even if you don’t run out of streams, there is undoubtedly a resource cost to each, and the resource cost, whatever it may be, is not well specified by NVIDIA.

I offer a test case of my opinion to students who are taking the DLI course “Accelerating CUDA C++ with multiple GPUs”. This course has a primary focus to teach compute/copy overlap, and of course streams are a central concept in that.

At one point we reach a fully overlapped test case that the students work on. A naive implementation leads to creation of ~40-50 streams (we only have one device in view at this point). And there is nothing wrong with that per-se. The chunking of work naturally lends itself to about 40 chunks. One stream per chunk. It works fine. However I demonstrate to the students that the same or better performance can be achieved using 4 streams, with stream re-use between chunks of work.

An indisputable takeaway from this learning is that there is no particular need in many/most cases of copy/compute overlap, for each stream to have its own HW queue, queue slot, or whatever words you wish to use. 4 streams is enough to get full/max performance, with a properly designed code.

So I am convinced, for a focused/coordinated/organized piece of work, all the benefit that can be had, can be had with ~4 streams per device. In that course, we don’t bother with CUDA_DEVICE_MAX_CONNECTIONS or imagine that it is a hidden CUDA secret that magically unlocks performance that only a few ninjas know about.

Instead we demonstrate that for a real-world application, using expert methods (overlap of copy and compute), no more than 4 streams are needed, not even approaching the default connections limit of 8 per device.

Can you find a counterexample? Probably. But my guess is that the counterexamples are probably 1% of extant uses (examples) of properly done copy/compute overlap in production codes. That’s just a guess. There are literally millions of CUDA codes out there. You have found 6 forum posts where people are asking possibly related questions. Expecting that this is a really important concept to delve into, with no description whatsoever of the structure of your case or why it might be important, does not align with my way of thinking.

For what its worth, I tried creating a large number of streams on a L4 GPU, and I got a cuda error message “out of memory” on trying to create 142649 streams. So there is a resource cost.

There is no idea in CUDA that “more streams is always better” or “more streams gives higher performance” or “using the max number of streams is best”. Using the max number of streams or device connections is not a figure of merit, generally.

Whenever you issue work to the device (i.e. a kernel call), there is well-founded conventional wisdom that says that you should seek to fully occupy the device, ideally with a single kernel call. That is the right mindset for determining how many blocks to launch, and the mechanics are covered in many forum posts and elsewhere. that means that ideally, in a copy/compute overlap scenario, each chunk of work (typically corresponding to one kernel call) should fully occupy the device. The only time you would deviate from this is when the overall problem size is so small that you cannot achieve this while still aiming for copy/compute overlap (lets say a minimum of 3 or 4 chunks, minimum). In that case I don’t think there is conventional wisdom, and you should probably play with chunk size vs. number of streams to achieve best performance, recognizing this is a non-optimal situation to begin with.

You might want to develop a test case of your own to gain additional understanding. Many of these concepts are covered in section 7 of this online training series. Once you have a test case, you can try varying CUDA_DEVICE_MAX_CONNECTIONS (or number of streams) if you wish, and if you have questions at that point, they are likely to be more focused. The homework for that section has an example that you could study if you wish.

Good luck!

3 Likes

Hi Robert,

Thank you for your prompt response. I’ve carefully reviewed your previous insights on the topic of streams and connections, and I’m convinced that employing a few well-designed streams should suffice for achieving overlap between data transfer, computation, and events.

Currently, I’m reading with the Cufile Stream API documentation. Essentially, this API binds asynchronous read/write operations to a stream. Thus, in order to achieve independent and unordered asynchronous I/O requests, it’s becoming apparent that relying on just one stream might not be adequate. Additionally, considering that the number of available connections can go up to 32, I’m thinking how to design asynchronous I/O into those 32 connections. I’m curious if you have any insights or experiences with the GDS Cufile Stream API on Streams that you could share?


Regarding the materials you recommended, I’ll certainly dive into them. Thanks again for your assistance!

I have no answers, but some related points:

  1. In addition to the limits you mention, there is a limit on the number of resident grids per device.
    See Table 18.

  2. If you use dynamic parallelism, you can easily reach the grid limit using only one stream created by the host.

An approach I have been exploring recently:

  • create only a small number of streams on the host, e.g., one or two streams for each desired stream priority
  • use a combination of host launches and dynamic parallelism to achieve the concurrency you want

In addition to the links you posted, beware of Increased time to synchronize….

I never suggested one stream is adequate for anything. One stream is useless for any purpose I can think of, and you certainly cannot do copy/compute overlap with a single stream.

By convention, GDS questions should be posted on the accelerated libraries sub-forum (there are many other GDS questions there).

1 Like

Apologies for any confusion - I didn’t mean to imply that you suggested the sentence earlier. That was a mistake in my formulation.

What I intended is that, based on streams and CuFile semantics, it seems natural to consider multiple streams.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.