I am a NVIDIA developer working on paralleling the computation of deep neural networks with cuDNN. I launch two functions of “cudnnConvolutionForward” in two different CUDA streams with no data-dependent (the GPU is Telsa P100, the CUDA version is 7.5, the cuDNN version is 5.0.5, and the OS is ubuntu14.04). The Nsight Eclipse is used to develop and profile the program. I hope that GPU can process perfectly these two functions in parallel.
Condition 1: However, what I can see with Nsight is that these two functions cannot be processed perfectly in parallel, the overlap rate is small.
Condition 2: When I try to reduce the volume of input data (computation data), the overlap rate seems to be better.
According to the two conditions shown above, I get the conclusion that the limited resources of GPU should be responsible for the small overlap rate of the two functions.
There are some questions that I need your help.
b Is the above conclusion correct?
(2) If the above conclusion is right, can you tell that which resources of GPU limit the parallelism?
(3) When I watch the resources of GPU with the command of ‘nvidia-smi’ during the running of the program, the result of ‘nvidia-smi’ shows that the value of ‘Volatile GPU-Util’ is less than 70% even in Condition 1. Hence, I doubt the above conclusion. Are there any other factors (such as some constraints come from cuDNN or CUDA streams)?
(4) Can you give me some reference materials about cuDNN or CUDA streams?[/b]
Thank you very much!
Best wishes to you!