Hi,
Not sure if this is the right categories as dlprof is basically a layer on top of Nsight but that’s the closest I could find.
I have been following the tutorials online to benchmark my PyTorch training an see the results in TensorBoard as described here: DLProf User Guide :: NVIDIA Deep Learning Frameworks Documentation Right off the bat I want to respectfully say that the documentation is not adequate. The script fails for seemingly random reasons there are no resources for troubleshooting and the experience overall is not great. The absence of a GitHub repository to allow users to share questions and solutions (instead relegated to this locked forum) also makes for a poor onboarding experience.
-
dlprof will fail when running for more than 10 iterations on my model. The error message is:
Processing events…
Saving temporary “/tmp/nsys-report-74f4-8509-41cc-327a.qdstrm” file to disk…Creating final output files…
Processing [==============================================================100%]
Saved report file to “/tmp/nsys-report-74f4-8509-41cc-327a.qdrep”
Exporting 4758702 events: [===============================================100%]Exported successfully to
/tmp/nsys-report-74f4-8509-41cc-327a.sqlite
Report file moved to “/home/ubuntu/./nsys_profile.qdrep”
Report file moved to “/home/ubuntu/./nsys_profile.sqlite”[DLProf-18:34:03] DLprof completed system call successfully
[DLProf-18:34:05] Initializing Nsight Systems database
[DLProf-18:34:20] Reading System Information from Nsight Systems database
[DLProf-18:34:20] Reading Domains from Nsight Systems database
[DLProf-18:34:23] Reading Ops from Nsight Systems database
[DLProf-18:34:40] Reading CUDA API calls from Nsight Systems database
[DLProf-18:57:23] Error Occurred:
[DLProf-18:57:23] Unable to find time_range {} = 0 in sequence infos for pid=24074 -
dlprof when successfully executing with <10 iterations will not generate any event files in the event_files/ directory as written in the documentation. An sqlite database file will be created but not read by TensorBoard. I know that the output is correct because the .csv files are generated an contain the information that would be displayed in TensorBoard. The conversion from those to the event files seems to be what is silently failing.
My system runs Ubuntu 18.04 with a V100 and CUDA 11.
Thank you for your time.
Edouard