Profiling and Optimizing Deep Neural Networks with DLProf and PyProf

Originally published at:

Software profiling is key for achieving the best performance on a system and that’s true for the data science and machine learning applications as well. In the era of GPU-accelerated deep learning, when profiling deep neural networks, it is important to understand CPU, GPU, and even memory bottlenecks, which could cause slowdowns in training or…

We have been using these profiling tools for my deep learning models frequently and were taking notes about my experiences here and there. This blog is a good summary of those experiences. Please feel free ask any questions you might have and/or share your feedback/comments with us.

Dear @ecan & @jwitsoe, thank you for your amazing post! I have just started into DL profiling, and had the following doubt.
I am trying nvidia-smi to observe distributed model training performance, and I seem to have some conflicting results. As shown in the picture below, both the gpus have full memory occupied, but if you look at GPU-Util, GPU[0] has very less utilization ( ~ 0% & fluctuating with wide range) when compared to GPU[1] (consistent around 100%). Now when I look at Pwr:Usage/Cap, one can observe the opposite trend, GPU[0] has high power usage (range 120-240W) as compared to GPU[1] (consistent ~110-120W).

Now as mentioned in the blog post, Pwr:Usage and Volatile GPU Util should be correlated, but as in this case one can see opposite trends. So my query is what should I consider while monitoring my model performance, and is there any way I could improve upon those numbers so that both my GPUs are used in its full capacity while distributed training.

System/Program info:
Dell Precision Tower 7920 (Single Machine)
Multiple GPUs: 2x Nvidia RTX 2080 Ti
Python 3.8
Pytorch: 1.7.1
Model: Self-Attention based visual model
Using mixed precision distributed data training along with PyTorch Autocast

Kindly let me know if any other information is needed. Thanks in advance for your help.

Hi @yashasvi1997,
Thanks for the question and reading the post :).

I would recommend upgrading the CUDA version as well as the Driver version to make sure that we are getting these results even with the upgraded drivers.

Second thing to try is to use the PyTorch container from the NGC. NGC containers has the best settings for the NVIDIA hardware. If you are getting the similar

If above does not help, I have another thought:
I believe the first GPU does the reduce operation. Possibly GPU1 does its processing (so is GPU0) and then GPU0 waits some numbers from GPU1 (e.g., gradients).

Given that now you have a little bit more complicated scenario and our naive initial weapon didn’t solve it. I would recommend you to start experimenting DLProf. I would even go a little bit further and try NSight to have a deeper analysis.
Even though nvidia-smi is a great tool, we might need a bit more analysis with other NVIDIA tools.

Please let me know if any of these help.


Thank you @ecan for your quick response!
Sure I’ll look into these options. My CUDA version is 10.2 and the Nvidia-Driver version is 440.33.01

I think using Nvidia NGC container seems to be the most optimized option. Just a small question though, does GeForce RTX 2080 Ti support running NGC containers? Because a quick search shows that support’s available for DGX/Titan/Quadro Pcs only. (I’ll definitely search more on this and revert back if the problem still prevails).

Thanks again for your support.

@yashasvi1997 I still recommend updating your driver and CUDA. NGC should work with your GPU - I haven’t tested myself though.

Is there a possibility to profile the NVLink traffic?

Hi @mohsin.shaikh, you can monitor NVLink traffic using nvidia-smi command line tool in terminal.
e.g. while you workload/model is running
user@dgx:~$ nvidia-smi nvlink -h
nvlink – Display NvLink information.
Usage: nvidia-smi nvlink [options]
Options include:
[-h | --help]: Display help information
[-i | --id]: Enumeration index, PCI bus ID or UUID.
[-l | --link]: Limit a command to a specific link. Without this flag, all link information is displayed.
[-s | --status]: Display link state (active/inactive).
[-c | --capabilities]: Display link capabilities.
[-p | --pcibusid]: Display remote node PCI bus ID for a link.
[-R | --remotelinkinfo]: Display remote device PCI bus ID and NvLink ID for a link.
[-sc | --setcontrol]: Setting counter control is deprecated!
[-gc | --getcontrol]: Getting counter control is deprecated!
[-g | --getcounters]: Getting counters using option -g is deprecated.
Please use option -gt/–getthroughput instead.
[-r | --resetcounters]: Resetting counters is deprecated!
[-e | --errorcounters]: Display error counters for a link.
[-ec | --crcerrorcounters]: Display per-lane CRC error counters for a link.
[-re | --reseterrorcounters]: Reset all error counters to zero.
[-gt | --getthroughput]: Display link throughput counters for specified counter type
The arguments consist of character string representing the type of traffic counted:
d: Display tx and rx data payload in KiB
r: Display tx and rx data payload and protocol overhead in KiB if supported

user@dgx:~$ nvidia-smi nvlink -gt d
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-a01f73b9-e8c9-89e5-6c19-1beaa6d64907)
Link 0: Data Tx: 0 KiB
Link 0: Data Rx: 0 KiB
Link 1: Data Tx: 0 KiB
Link 1: Data Rx: 0 KiB
Link 2: Data Tx: 0 KiB
Link 2: Data Rx: 0 KiB
Link 3: Data Tx: 0 KiB
Link 3: Data Rx: 0 KiB
Link 4: Data Tx: 0 KiB
Link 4: Data Rx: 0 KiB
Link 5: Data Tx: 0 KiB
Link 5: Data Rx: 0 KiB
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-bb6446d0-c867-43e5-1eae-7ced263f2372)
Link 0: Data Tx: 0 KiB
Link 0: Data Rx: 0 KiB

Hi @ecan
I‘m trying to use Dlprof to profile my tensorflow script , and want to evaluate training & eval step separately .
Found <nsys_profile_range > parameter in section 4.6 of dlprof user guide,But how could I set the start or stop point just for traing section ?
ngc :

Thanks for your help

How about disabling eval and doing training first and profile it? Then you can profile the eval with no training. Would that work?

Thanks for your help @ecan !
Yes, it can be work if disabled training step .
By the way , cloud I set profile.start() & profile.end() around to profile separately, or that’s only useful for pytorch dlprof?

That is in pytorch. I believe tf has something similar but I couldn’t remember right now. I will update here if I find it. Thanks

Hi ecan , @ecan
Now I try to batch profile of different networks inference by trtexec in one shell script,such as:

dlprof --mode=tensorrt  --iter_start=200 --iter_stop=209 --reports=all --output_path=./dlprof/1  trtexec  --loadEngine=1.trt  
dlprof --mode=tensorrt  --iter_start=200 --iter_stop=209 --reports=all --output_path=./dlprof/2  trtexec  --loadEngine=2.trt
dlprof --mode=tensorrt  --iter_start=200 --iter_stop=209 --reports=all --output_path=./dlprof/xx  trtexec  --loadEngine=xx.trt

xx.trt is build from onnx file by tensorRT.

However, an error will be reported that the database is locked on the network at random

[DLProf-01:26:26] DLprof completed system call successfully
[DLProf-01:27:59] Error Occurred:
[DLProf-01:27:59] database is locked

If it’s possible, could you tell me what’s the reason and how to avoid it .

Can you please check how many files / folders 1 dlprof call creates? dlprof uses default names for a range out output files / folders and this error seemed to me like a file from the previous call is still locked while another one is trying to write to that file. Thanks.