Huge data file is generated by Nsight

I used the following command, without any metric, to profile an application with Nsight

/usr/local/cuda-10.1/NsightCompute-2019.1/nv-nsight-cu-cli -f -o profile.nsight …/…/…/build/bin/bfs market mygraph.mtx

I have profiled the same input with nvprof and the output file is small and the runtime is quick. However, I see that Nsight still profiles after 1 hour and the output file is about 2GB!

Is that normal? I don’t think so…

Please note that in the default setup, nvprof doesn’t profile any kernel metrics, it only collects a trace of API and GPU events. The overhead for this is regularly much smaller than for collecting metrics. Nsight Compute on the other hand is not intended for tracing the application, but only for collecting metrics for targeted kernels. As a result, it collects a wide range of such metrics by default, which causes a relatively high overhead.

If you intend to get the same behavior as nvprof in its default config, you need to use our new tracing tool Nsight Systems https://developer.nvidia.com/nsight-systems.

You can try to speedup the data collection for Nsight Compute by e.g. reducing the number of profiled kernels and the collected metrics. E.g. you could use

nv-nsight-cu-cli -c 1 --section SpeedOfLight

to collect only the SpeedOfLight metric section for only one kernel. You can check the documentation for other configuration options.

In the next version of the tool, the default set of metrics will be reduced to add less overhead when not configuring anything, but it will still be significantly larger than with nvprof tracing.

Hi Felix,
I tried with the visual interface and you can see in the pictures that I have only selected instruction mix analysis.

External Image
External Image

As I launch the profiler, I see slow prints such as

==PROF== Connected to process 38377
==PROF== Profiling "kernel_info" - 1: 0%....50%....100% - 3 passes
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (67.1838 67.1838 67.1838)
  1 by 1 by 1 MPI processor grid
Created 256000 atoms
  create_atoms CPU = 0.0724353 secs

--------------------------------------------------------------------------
- Using acceleration for lj/cut:
-  with 1 proc(s) per device.
--------------------------------------------------------------------------
Device 0: GeForce RTX 2080 Ti, 68 CUs, 10/11 GB, 1.5 GHZ (Single Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Device 0 on core 0...Done.

==PROF== Profiling "kernel_zero" - 2: 0%....50%....100% - 3 passes
==PROF== Profiling "kernel_zero" - 3: 0%....50%....100% - 3 passes
==PROF== Profiling "kernel_zero" - 4: 0%....50%....100% - 3 passes
==PROF== Profiling "kernel_zero" - 5: 0%....50%....100% - 3 passes
==PROF== Profiling "kernel_zero" - 6: 0%....50%....100% - 3 passes
==PROF== Profiling "kernel_zero" - 7: 0%....50%....100% - 3 passes
==PROF== Profiling "kernel_zero" - 8: 0%....50%....100% - 3 passes
==PROF== Profiling "kernel_zero" - 9: 0%....50%....100% - 3 passes
==PROF== Profiling "kernel_zero" - 10: 0%....50%....100% - 3 passes
==PROF== Profiling "kernel_zero" - 11: 0%....50%....100% - 3 passes
Setting up Verlet run ...
  Unit style    : lj
  Current step  : 0
  Time step     : 0.005
==PROF== Profiling "calc_cell_id" - 12: 0%....50%....100% - 3 passes
==PROF== Profiling "radixSortBlocks" - 13: 0%....50%....100% - 3 passes
==PROF== Profiling "findRadixOffsets" - 14: 0%....50%....100% - 3 passes
==PROF== Profiling "scan4" - 15: 0%....50%....100% - 3 passes
==PROF== Profiling "scan4" - 16: 0%....50%....100% - 3 passes
==PROF== Profiling "vectorAddUniform4" - 17: 0%....50%....100% - 3 passes
==PROF== Profiling "reorderData" - 18: 0%....50%....100% - 3 passes
==PROF== Profiling "radixSortBlocks" - 19: 0%....50%....100% - 3 passes
==PROF== Profiling "findRadixOffsets" - 20: 0%....50%....100% - 3 passes
==PROF== Profiling "scan4" - 21: 0%....50%....100% - 3 passes
==PROF== Profiling "scan4" - 22: 0%....50%....100% - 3 passes
==PROF== Profiling "vectorAddUniform4" - 23: 0%....50%....100% - 3 passes
==PROF== Profiling "reorderData" - 24: 0%....50%....100% - 3 passes
...
...

The file size is also growing quickly. After some minutes and while program is sill in the beginning (I can see the main printf statements haven’t printed on the screen), the file size reaches 20MB.

With nvprof, the total file size for the whole duration of the program is less than 10MB.

With nvprof, the total file size for the whole duration of the program is less than 10MB.
You haven’t listed which exact nvprof command you were using, but I assume you didn’t collect instruction metrics, but rather used the default (trace) mode? If so, I see no reason why the files should have the same or similar size, since they collect completely different data. That also implies the very different overhead between the tools.

I agree that we should look into reducing the file size required by Nsight Compute. However, I don’t think the current behavior is blocking users from optimizing kernels, since collecting profile data for a large number of kernels is not the intended usage scenario of the tool, and almost always not necessary.

If you could let us know what you are trying to analyze or optimize, maybe we can get you some insight on how to achieve this most efficiently?

I run these two commands

nvprof --metrics inst_bit_convert,inst_compute_ld_st,inst_control,inst_fp_32,inst_integer,inst_inter_thread_communication,inst_misc -f -o imix.nvvp --log-file nvvp.log lmp_mpi -sf gpu -in in.lj

nvprof --metrics inst_fp_32,flop_count_sp,flop_count_sp_add,flop_count_sp_fma,flop_count_sp_mul,flop_count_sp_special -f -o fpmix.nvvp --log-file nvvp.log lmp_mpi -sf gpu -in in.lj

The first file is 7.8MB and the latter is 6MB.

since they collect completely different data
I agree since I don’t know what are collecting when I mark the instruction analysis. However, still I think that is huge.

Even when I only select GPU speed of light, the output is large.

Let me test the matrix multiplication example from sdk and I will come back.

@Felix,
Here is a new test.
I ran ./matrixMul -wA=4096 -hA=4096 -wB=4096 -hB=4096 with the following configurations.

TitanV which is CC7.0:
1- Using NsightCompute with GPU speed of light section.
2- Using nvprof -o matrixmul.titanv.nvvp ./matrixMul -wA=4096 -hA=4096 -wB=4096 -hB=4096

2080Ti which is CC7.5:
1- Using NsightCompute with GPU speed of light section.
2- nvprof not applicable.

I have attached three result files. As you can see the nvprof size is about 400KB. While NsightCompute results are about 4MB.

Please download from [url]https://gofile.io/?c=jJbPnP[/url].

Also, if you have look at the results, the memory utilization on 2080ti is about 45%. However, according to nvidia-smi, the program use about 400MB out of 12GB of memory.

See the picture at [url]https://pasteboard.co/ICy8TVz.png[/url]

The SOL Memory metric in the GPU Speed Of Light section does not indicate the amount of memory used by the application. It rather indicates the application’s average utilization of the GPUs memory pipelines and resources, with respect to their theoretical maximum. Typically, this will be limited by properties such as DRAM or cache throughput, but many individual properties contribute to it.

As I mentioned earlier, your usage of nvprof as

nvprof -o matrixmul.titanv.nvvp ./matrixMul -wA=4096 -hA=4096 -wB=4096 -hB=4096

creates a trace, while collecting the GPU speed of light section with Nsight Compute collects very different data (per-kernel profiling metrics). We can look into reducing the size of the reports on disk, but there is no reason why the nvprof report should be bigger or the same size as the Nsight Compute report.