cuDNN with CUDA streams

Dear Developers,

I am a NVIDIA developer working on paralleling the computation of deep neural networks with cuDNN. I launch two functions of “cudnnConvolutionForward” in two different CUDA streams with no data-dependent (the GPU is Telsa P100, the CUDA version is 7.5, the cuDNN version is 5.0.5, and the OS is ubuntu14.04). The Nsight Eclipse is used to develop and profile the program. I hope that GPU can process perfectly these two functions in parallel.

Condition 1: However, what I can see with Nsight is that these two functions cannot be processed perfectly in parallel, the overlap rate is small.

Condition 2: When I try to reduce the volume of input data (computation data), the overlap rate seems to be better.

According to the two conditions shown above, I get the conclusion that the limited resources of GPU should be responsible for the small overlap rate of the two functions.

There are some questions that I need your help.
b Is the above conclusion correct?
(2) If the above conclusion is right, can you tell that which resources of GPU limit the parallelism?
(3) When I watch the resources of GPU with the command of ‘nvidia-smi’ during the running of the program, the result of ‘nvidia-smi’ shows that the value of ‘Volatile GPU-Util’ is less than 70% even in Condition 1. Hence, I doubt the above conclusion. Are there any other factors (such as some constraints come from cuDNN or CUDA streams)?
(4) Can you give me some reference materials about cuDNN or CUDA streams?[/b]

Thank you very much!
Best wishes to you!

cross posting:

https://stackoverflow.com/questions/49625151/how-does-the-gpu-process-perfectly-two-cudnn-functions-launch-in-two-cuda-stream

notes:

  1. Yes
  2. Possibly blocks launched per kernel. Reducing volume of input data may reduce kernel size (# of blocks)
  3. You’re misinterpreting that utilization number, see here: https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation/40938696#40938696
  4. For CUDA streams, read the programming guide:
    http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams
    For cuDNN, read the user’s guide.

We created a new “Deep Learning Training and Inference” section in Devtalk to improve the experience for deep learning and accelerated computing, and HPC users:
https://devtalk.nvidia.com/default/board/301/deep-learning-training-and-inference-/

We are moving active deep learning threads to the new section.

URLs for topics will not change with the re-categorization. So your bookmarks and links will continue to work as earlier.

-Siddharth