Does cudnnCreate() call create multiple streams internally?

I am writing a simple multi-stream CUDA application. Following is the part of code where I create cuda streams , cublass handle and cudnn_handle :


int num_streams = 1;

cudaStream_t streams[num_streams];
cudnnHandle_t mCudnnHandle[num_streams];
cublasHandle_t mCublasHandle[num_streams];

for (int ii = 0; ii < num_streams; ii++) {
    cudaStreamCreateWithFlags(&streams[ii], cudaStreamNonBlocking);
    cublasSetStream(mCublasHandle[ii], streams[ii]);
    cudnnSetStream(mCudnnHandle[ii], streams[ii]);

Now, my stream count is 1. But when I profile the executable of above application using Nvidia Visual Profiler I get following:

For every stream I create it creates additional 4 more streams. I tested it with num_streams = 8 , it showed 40 streams in profiler. It raised following questions in my mind:

  1. Does cudnn internally create streams? If yes, then why?
  2. If it implicitly creates streams then what is the way to utilize it?
  3. In such case does explicitly creating streams make any sense?

Hi @sandip.ganage,
Apologies for the miss,
Yes it does - cuDNN can use internal streams for work and synchronize with passed streams. From memory an example of this was FFT-based convolution algorithms where many gemms could be run at the same time on multiple streams.