About CPU thread increase when calling CUDA interface

When I called the CUDA API, I found a problem. An extra CPU thread is created when the process first calls the cudaGetDeviceCount interface. Two more CPU threads are created when cudaStreamCreateWithFlags is first called. What do these extra CPU threads do? What’s the impact?

CUDA runtime uses CPU threads that it creates for its own purposes. Not all are documented. Some of the purposes I am aware of are:

  1. facilitating cudaMemcpy operations when source or destination are pointing to host pageable memory in some cases.
  2. providing a vehicle for the execution of a callback scheduled by cudaLaunchHostFunc

There may be other uses I am not aware of.

I know of no methods to affect or modify the behavior of the CUDA runtime spinning up CPU threads according to its needs, nor do I know of any detailed documentation explaining what they are used for, how many threads to expect, or anything like that.