I am writing a simple multi-stream CUDA application. Following is the part of code where I create cuda streams , cublass handle and cudnn_handle :
cudaSetDevice(0);
int num_streams = 1;
cudaStream_t streams[num_streams];
cudnnHandle_t mCudnnHandle[num_streams];
cublasHandle_t mCublasHandle[num_streams];
for (int ii = 0; ii < num_streams; ii++) {
cudaStreamCreateWithFlags(&streams[ii], cudaStreamNonBlocking);
cublasCreate(&mCublasHandle[ii]);
cublasSetStream(mCublasHandle[ii], streams[ii]);
cudnnCreate(&mCudnnHandle[ii]);
cudnnSetStream(mCudnnHandle[ii], streams[ii]);
}
Now, my stream count is 1. But when I profile the executable of above application using Nvidia Visual Profiler I get following:
For every stream I create it creates additional 4 more streams. I tested it with
num_streams = 8 , it showed 40 streams in profiler. It raised following questions in my mind:
- Does
cudnninternally create streams? If yes, then why? - If it implicitly creates streams then what is the way to utilize it?
- In such case does explicitly creating streams make any sense?
