How is CUDA stream implemented?

I am curious how CUDA stream is implemented at system level. Specifically, I am wondering if creating a new stream means a new host thread is created in the calling process.

I am asking this question because I observed additional threads not created by me in my application. For instance, as I ran the command “pstree”, I got the following process tree:

smaq_server─┬─6*[smaq_server───2*[{smaq_server}]]
└─7*[{smaq_server}]

It is clear that each child process is of two threads. In my code, I didn’t manually create new threads. My only explanation is that cudaStreamCreate in the code created a new thread to regulate this particular stream. I am wondering if this explanation is accurate.

A cuda stream is mostly a device side activity. It coordinates work on the device. It should not require a thread per stream in all cases.

The CUDA driver will create additional threads from time to time, as needed. For example, a stream that does or uses a host code callback will require an extra thread beyond any that your application creates. If you had multiple streams each doing a callback, I don’t know if that will create one thread per stream, or not.

I don’t believe these details are specified anywhere.