Is there a user context and stream context? and how tasks are arranged in different streams and delivered to hardware?

I have some questions about CUDA stream and how the tasks execute, could you help me understand the internal workings? or could you point me out some documents explicitly talking about those content?

Q1: Is there a user context and stream context in cuda runtime lib or driver or even inside the hardware?

GPU card is a shared resource between different user processes which can be thought as users of the hardware. Further more each user can create multiple streams. I think the context for each user is separated so that different users can not interfere with each other. So do for stream context. So I’m wondering is there a user context and stream context in cuda runtime lib or driver or even inside the hardware?

Q2: How does the stream ensure executing tasks exactly in order?

How could the CUDA ensure the hardware executes tasks in the same stream exactly in order? How are tasks arranged in the stream? is there a unique queue respectively for each stream to store the tasks? If yes, then is the queue maintained by the user space CUDA runtime or by the driver or even by hardware?

Especially, for a cudaMemcpy task and a kernel computing task, the former is sent to PCIe DMA engine, the latter is done by GPU SM, how could they be put in the same queue?

Q3: How could different streams deliver tasks to hardware simultaneously?

Different streams can execute in parallel, so are there independent command channels between driver and hardware respectively for each stream? If so, there should be large number of command channels, and the number is uncertain as the number of streams is uncertain, which seems somewhat unpractical. But if not, a single command channel for all streams would also seems unreasonable.

context has similarities to a process in CPU land. When using the CUDA runtime API, there is one context per device per host process. When using the driver API you create your own contexts. There is not a context per stream.

For questions 2 and 3, as far as I know that information is unpublished by NVIDIA. There is a small amount of information on device connections which do have a relationship with streams. You can google for mentions of this topic. But it doesn’t address all your questions.

Thanks for your reply that gave me some directions to learn more!