I have a question regarding the relationship between the number of CUDA streams and CUDA_DEVICE_MAX_CONNECTIONS
. Although not explicitly documented in CUDA resources, this topic is relevant in our community discussions. After thoroughly researching relevant posts, I’ve gathered insights and aim to verify my understanding.
To start, it’s generally accepted that each CUDA context does not have a hard limit on the number of streams. However, there’s a practical constraint imposed by CUDA_DEVICE_MAX_CONNECTIONS
, which is up to 32. This limitation affects the true underlying independent parallelism between streams. If the number of streams exceeds CUDA_DEVICE_MAX_CONNECTIONS
, kernels from different streams may still be stalled and ordered, potentially leading to false dependencies and errors.
First, I seek clarification on a specific aspect of this statement:
- the distinction between “connections” and “HW queues.” Are these terms synonymous? I perceive “connections” as a software concept, referring to the connection between the CUDA driver on the host and the GPU device, while “HW queues” may represent physical hardware components with finite capacities.
- Are both resources limited per CUDA context?
My primary question is maximizing performance but besides stream creation and connection creation times. I define performance as the ability to launch a large number of independent kernels but with identical launch parameters. I contemplate two feasible approaches:
- Approach 1: Set
CUDA_DEVICE_MAX_CONNECTIONS
to 32 and create 32 streams, with the hope that each stream has an individual connection and queue slot. Then, launch kernels on these streams. Kernels from different streams are unordered, but within a stream, they adhere to CUDA stream semantics. - Approach 2: Similarly, set
CUDA_DEVICE_MAX_CONNECTIONS
to 32 but create hundreds or thousands of streams, allowing CUDA to schedule them. Within each stream, launch kernels independently (though they may not execute concurrently in reality).
Which approach appears more favorable to performance? Additionally, assuming a single CUDA context with numerous kernels, and considering that recent GPUs feature over 100 streaming multiprocessors (SMs), if CUDA_DEVICE_MAX_CONNECTIONS
is set to 32 and 32 streams are created, should I carefully manage the number of blocks per kernel to ensure that all 32 streams effectively utilize the SMs concurrently? Because I am also thinking why CUDA_DEVICE_MAX_CONNECTIONS` has a default value of 8, but not the max.
Apologies for the extensive question, but I hope I’ve effectively communicated my queries. My inquiries stem from exploring CUDA streams, connections, and MPS, and I struggle to integrate all these concepts into a cohesive understanding of streams, connections, HW queues, and SMs.
Relevant Posts:
https://forums.developer.nvidia.com/t/concurrent-kernel-and-events-on-kepler