I have profiled the same input with nvprof and the output file is small and the runtime is quick. However, I see that Nsight still profiles after 1 hour and the output file is about 2GB!
Please note that in the default setup, nvprof doesn’t profile any kernel metrics, it only collects a trace of API and GPU events. The overhead for this is regularly much smaller than for collecting metrics. Nsight Compute on the other hand is not intended for tracing the application, but only for collecting metrics for targeted kernels. As a result, it collects a wide range of such metrics by default, which causes a relatively high overhead.
You can try to speedup the data collection for Nsight Compute by e.g. reducing the number of profiled kernels and the collected metrics. E.g. you could use
nv-nsight-cu-cli -c 1 --section SpeedOfLight
to collect only the SpeedOfLight metric section for only one kernel. You can check the documentation for other configuration options.
In the next version of the tool, the default set of metrics will be reduced to add less overhead when not configuring anything, but it will still be significantly larger than with nvprof tracing.
The file size is also growing quickly. After some minutes and while program is sill in the beginning (I can see the main printf statements haven’t printed on the screen), the file size reaches 20MB.
With nvprof, the total file size for the whole duration of the program is less than 10MB.
With nvprof, the total file size for the whole duration of the program is less than 10MB.
You haven’t listed which exact nvprof command you were using, but I assume you didn’t collect instruction metrics, but rather used the default (trace) mode? If so, I see no reason why the files should have the same or similar size, since they collect completely different data. That also implies the very different overhead between the tools.
I agree that we should look into reducing the file size required by Nsight Compute. However, I don’t think the current behavior is blocking users from optimizing kernels, since collecting profile data for a large number of kernels is not the intended usage scenario of the tool, and almost always not necessary.
If you could let us know what you are trying to analyze or optimize, maybe we can get you some insight on how to achieve this most efficiently?
since they collect completely different data
I agree since I don’t know what are collecting when I mark the instruction analysis. However, still I think that is huge.
Even when I only select GPU speed of light, the output is large.
Let me test the matrix multiplication example from sdk and I will come back.
@Felix,
Here is a new test.
I ran ./matrixMul -wA=4096 -hA=4096 -wB=4096 -hB=4096 with the following configurations.
TitanV which is CC7.0:
1- Using NsightCompute with GPU speed of light section.
2- Using nvprof -o matrixmul.titanv.nvvp ./matrixMul -wA=4096 -hA=4096 -wB=4096 -hB=4096
2080Ti which is CC7.5:
1- Using NsightCompute with GPU speed of light section.
2- nvprof not applicable.
I have attached three result files. As you can see the nvprof size is about 400KB. While NsightCompute results are about 4MB.
Also, if you have look at the results, the memory utilization on 2080ti is about 45%. However, according to nvidia-smi, the program use about 400MB out of 12GB of memory.
The SOL Memory metric in the GPU Speed Of Light section does not indicate the amount of memory used by the application. It rather indicates the application’s average utilization of the GPUs memory pipelines and resources, with respect to their theoretical maximum. Typically, this will be limited by properties such as DRAM or cache throughput, but many individual properties contribute to it.
creates a trace, while collecting the GPU speed of light section with Nsight Compute collects very different data (per-kernel profiling metrics). We can look into reducing the size of the reports on disk, but there is no reason why the nvprof report should be bigger or the same size as the Nsight Compute report.