Profiling and Optimizing Deep Neural Networks with DLProf and PyProf

jwitsoe · September 28, 2020, 6:33pm

Originally published at: https://developer.nvidia.com/blog/profiling-and-optimizing-deep-neural-networks-with-dlprof-and-pyprof/

Software profiling is key for achieving the best performance on a system and that’s true for the data science and machine learning applications as well. In the era of GPU-accelerated deep learning, when profiling deep neural networks, it is important to understand CPU, GPU, and even memory bottlenecks, which could cause slowdowns in training or…

ecan · September 28, 2020, 11:07pm

We have been using these profiling tools for my deep learning models frequently and were taking notes about my experiences here and there. This blog is a good summary of those experiences. Please feel free ask any questions you might have and/or share your feedback/comments with us.

yashasvi1997 · January 14, 2021, 11:25pm

Dear @ecan & @jwitsoe, thank you for your amazing post! I have just started into DL profiling, and had the following doubt.
I am trying nvidia-smi to observe distributed model training performance, and I seem to have some conflicting results. As shown in the picture below, both the gpus have full memory occupied, but if you look at GPU-Util, GPU[0] has very less utilization ( ~ 0% & fluctuating with wide range) when compared to GPU[1] (consistent around 100%). Now when I look at Pwr:Usage/Cap, one can observe the opposite trend, GPU[0] has high power usage (range 120-240W) as compared to GPU[1] (consistent ~110-120W).

Now as mentioned in the blog post, Pwr:Usage and Volatile GPU Util should be correlated, but as in this case one can see opposite trends. So my query is what should I consider while monitoring my model performance, and is there any way I could improve upon those numbers so that both my GPUs are used in its full capacity while distributed training.

System/Program info:
Dell Precision Tower 7920 (Single Machine)
Multiple GPUs: 2x Nvidia RTX 2080 Ti
Python 3.8
Pytorch: 1.7.1
Model: Self-Attention based visual model
Using mixed precision distributed data training along with PyTorch Autocast

Kindly let me know if any other information is needed. Thanks in advance for your help.

ecan · January 15, 2021, 7:56pm

Hi @yashasvi1997,
Thanks for the question and reading the post :).

I would recommend upgrading the CUDA version as well as the Driver version to make sure that we are getting these results even with the upgraded drivers.

Second thing to try is to use the PyTorch container from the NGC. NGC containers has the best settings for the NVIDIA hardware. If you are getting the similar

If above does not help, I have another thought:
I believe the first GPU does the reduce operation. Possibly GPU1 does its processing (so is GPU0) and then GPU0 waits some numbers from GPU1 (e.g., gradients).

Given that now you have a little bit more complicated scenario and our naive initial weapon didn’t solve it. I would recommend you to start experimenting DLProf. I would even go a little bit further and try NSight to have a deeper analysis.
Even though nvidia-smi is a great tool, we might need a bit more analysis with other NVIDIA tools.

Please let me know if any of these help.

Thanks

yashasvi1997 · January 16, 2021, 12:01am

Thank you @ecan for your quick response!
Sure I’ll look into these options. My CUDA version is 10.2 and the Nvidia-Driver version is 440.33.01

I think using Nvidia NGC container seems to be the most optimized option. Just a small question though, does GeForce RTX 2080 Ti support running NGC containers? Because a quick search shows that support’s available for DGX/Titan/Quadro Pcs only. (I’ll definitely search more on this and revert back if the problem still prevails).

Thanks again for your support.

ecan · January 19, 2021, 7:22pm

@yashasvi1997 I still recommend updating your driver and CUDA. NGC should work with your GPU - I haven’t tested myself though.

mohsin.shaikh · January 24, 2021, 12:53pm

Hi,
Is there a possibility to profile the NVLink traffic?

rajana · January 25, 2021, 7:04pm

Hi @mohsin.shaikh, you can monitor NVLink traffic using nvidia-smi command line tool in terminal.
e.g. while you workload/model is running
user@dgx:~$ nvidia-smi nvlink -h
nvlink – Display NvLink information.
Usage: nvidia-smi nvlink [options]
Options include:
[-h | --help]: Display help information
[-i | --id]: Enumeration index, PCI bus ID or UUID.
[-l | --link]: Limit a command to a specific link. Without this flag, all link information is displayed.
[-s | --status]: Display link state (active/inactive).
[-c | --capabilities]: Display link capabilities.
[-p | --pcibusid]: Display remote node PCI bus ID for a link.
[-R | --remotelinkinfo]: Display remote device PCI bus ID and NvLink ID for a link.
[-sc | --setcontrol]: Setting counter control is deprecated!
[-gc | --getcontrol]: Getting counter control is deprecated!
[-g | --getcounters]: Getting counters using option -g is deprecated.
Please use option -gt/–getthroughput instead.
[-r | --resetcounters]: Resetting counters is deprecated!
[-e | --errorcounters]: Display error counters for a link.
[-ec | --crcerrorcounters]: Display per-lane CRC error counters for a link.
[-re | --reseterrorcounters]: Reset all error counters to zero.
[-gt | --getthroughput]: Display link throughput counters for specified counter type
The arguments consist of character string representing the type of traffic counted:
d: Display tx and rx data payload in KiB
r: Display tx and rx data payload and protocol overhead in KiB if supported

user@dgx:~$ nvidia-smi nvlink -gt d
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-a01f73b9-e8c9-89e5-6c19-1beaa6d64907)
Link 0: Data Tx: 0 KiB
Link 0: Data Rx: 0 KiB
Link 1: Data Tx: 0 KiB
Link 1: Data Rx: 0 KiB
Link 2: Data Tx: 0 KiB
Link 2: Data Rx: 0 KiB
Link 3: Data Tx: 0 KiB
Link 3: Data Rx: 0 KiB
Link 4: Data Tx: 0 KiB
Link 4: Data Rx: 0 KiB
Link 5: Data Tx: 0 KiB
Link 5: Data Rx: 0 KiB
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-bb6446d0-c867-43e5-1eae-7ced263f2372)
Link 0: Data Tx: 0 KiB
Link 0: Data Rx: 0 KiB

maoares088ks · July 19, 2021, 1:34pm

Hi @ecan
I‘m trying to use Dlprof to profile my tensorflow script , and want to evaluate training & eval step separately .
Found <nsys_profile_range > parameter in section 4.6 of dlprof user guide，But how could I set the start or stop point just for traing section ？
ngc : nvcr.io/nvidia/tensorflow:21.06-tf1-py3

Thanks for your help

ecan · July 19, 2021, 2:35pm

Hi,
How about disabling eval and doing training first and profile it? Then you can profile the eval with no training. Would that work?
Thanks

maoares088ks · July 20, 2021, 1:13am

Thanks for your help @ecan !
Yes, it can be work if disabled training step .
By the way , cloud I set profile.start() & profile.end() around sess.run to profile separately, or that’s only useful for pytorch dlprof？
Thanks

ecan · July 26, 2021, 4:39pm

That is in pytorch. I believe tf has something similar but I couldn’t remember right now. I will update here if I find it. Thanks

maoares088ks · August 5, 2021, 8:38am

Hi ecan , @ecan
Now I try to batch profile of different networks inference by trtexec in one shell script，such as：

dlprof --mode=tensorrt  --iter_start=200 --iter_stop=209 --reports=all --output_path=./dlprof/1  trtexec  --loadEngine=1.trt  
dlprof --mode=tensorrt  --iter_start=200 --iter_stop=209 --reports=all --output_path=./dlprof/2  trtexec  --loadEngine=2.trt
：
dlprof --mode=tensorrt  --iter_start=200 --iter_stop=209 --reports=all --output_path=./dlprof/xx  trtexec  --loadEngine=xx.trt

xx.trt is build from onnx file by tensorRT.

However, an error will be reported that the database is locked on the network at random

[DLProf-01:26:26] DLprof completed system call successfully
[DLProf-01:27:59] Error Occurred:
[DLProf-01:27:59] database is locked

If it’s possible, could you tell me what’s the reason and how to avoid it .

ecan · August 11, 2021, 10:09pm

Can you please check how many files / folders 1 dlprof call creates? dlprof uses default names for a range out output files / folders and this error seemed to me like a file from the previous call is still locked while another one is trying to write to that file. Thanks.

Topic		Replies	Views
Profiling NCCL Deep Learning (Training & Inference)	0	528	October 22, 2018
nvprof seems to make inference slower, no tensor cores being used Jetson AGX Xavier	4	996	October 18, 2021
Profiling deadloop (replay kernel) with nvprof on deep neural network Visual Profiler and nvprof	8	3332	August 24, 2017
DLProf Pytorch NVTX annotations overhead Profiling Linux Targets nsight , pytorch	0	1051	September 9, 2021
From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ? CUDA Programming and Performance	24	17362	June 8, 2010
DLProf crash Profiling Linux Targets nsight , deep-learning-profiler	10	2070	September 1, 2021
nvprof with tensorflow is suspiciously slow CUDA Programming and Performance	7	1554	January 19, 2019
Calling nvprof from a pythoon code cuDNN	0	494	January 26, 2019
nvprof: Internal profiling error 4277:5 on Tesla P100, but not on GTX 1070 Visual Profiler and nvprof	12	4024	October 12, 2021
Time To Profile CUDA Programming and Performance	11	5661	October 20, 2011

Profiling and Optimizing Deep Neural Networks with DLProf and PyProf

Related topics