I am curious how CUDA stream is implemented at system level. Specifically, I am wondering if creating a new stream means a new host thread is created in the calling process.
I am asking this question because I observed additional threads not created by me in my application. For instance, as I ran the command “pstree”, I got the following process tree:
smaq_server─┬─6*[smaq_server───2*[{smaq_server}]]
└─7*[{smaq_server}]
It is clear that each child process is of two threads. In my code, I didn’t manually create new threads. My only explanation is that cudaStreamCreate in the code created a new thread to regulate this particular stream. I am wondering if this explanation is accurate.