ncu -i yolo_infer_bs_32_epoch_20.ncu-rep --csv --page raw
I get the following error:
[libprotobuf ERROR /dvs/p4/build/sw/devtools/Agora/Rel/DTC_F/Imports/Source/ProtoBuf/protobuf-3_21_1/src/google/protobuf/message_lite.cc:133] Can't parse message of type "NV.Profiler.Messages.ProfileResult" because it is missing required fields: RuleResults[0].Body.Items[0].Message.Type
==ERROR== Failed to load report file 'yolo_infer_bs_64_epoch_20.ncu-rep'.
I tried opening it in the same version as well. I tried changing the extension to .txt and upload it but the YOLO files are very large exceeding the maximum upload capacity.
There were other problems with YOLO that I have raised but not solved yet in the forum. These are:
YOLO profiling takes too long several hours to profile (inference and training both) whereas even larger models like BERT get profiled in lesses time.
YOLO files are very large in size due to this larger profiling time.
Can you reproduce this issue with a more minimal example/app as well?
Looking at the error, I assume the report file got corrupted, possibly because you interrupted profiling or because of the sheer size of the report when persisting it to disc.
YOLO profiling takes too long
YOLO files are very large in size
Nsight Compute is a kernel-level profiler that should be used to profile individual kernels and understand their optimization potential. To understand the overall performance of your application, start with Nsight Systems instead. Nsight Systems can help you identify which kernel(s) (if any) you should focus on to improve the performance of your application. Once identified, you can profile only these by using the various filtering options of NCU. See here for a large set of filter option examples.
If you run NCU to profile possibly 100s or 1,000s of kernels within a complex application, it will not only take a very long time, but also produce very large reports. And even if you wait long enough to collect them, they won’t be useful to you, because the UI is not visually designed to enable you to comprehend data from 1,000s of kernels simultaneously (not mentioning a reduction in usability through decreasing response times of the UI elements).
Two things you can do apart from/along with filtering to decrease profiling time and report sizes:
include fewer metrics (instead of using --set roofline, specify metrics with --metrics individually; also see the Metrics Reference)
disable rules with --apply-rules no
You should also consider using range replay or application range replay if you’re ultimately interested in overall hardware unit utilization instead of individual kernels. To do this, you will need to include NVTX ranges into your application, see this guide.