nv-nsight-cu-cli profiles every kernel 47x, is very slow

I’m having issues profiling things with nv-nsight-cu-cli. When run by itself, the following tensorflow program took less than a minute (much less; ten seconds, perhaps?):

/usr/local/NVIDIA-Nsight-Compute/nv-nsight-cu-cli \
    -o mnist_softmax_deep_fp16_advanced.ns-cuprof-report \
    ~/edit/venv/bin/python mnist_softmax_deep_fp16_advanced.py

Running it under nv-nsight-cu-cli has been running for over an hour and it’s unclear how far progressed it is. There is a lot of output of the form: ==PROF== Profiling "EigenMetaKernel" - 1120: 0%....50%....100% - 47 passes. This is problematic because the real program I need to profile normally takes 10 minutes to run.

What can I do to have it profile at something approaching real-time?

The code is from https://github.com/khcs/fp16-demo-tf/blob/master/mnist_softmax_deep_fp16_advanced.py .

What you are using there is the Nsight Compute CLI, which is intended for deep dives into individual kernels. If you are looking for high level system profiling, you should be using Nsight Systems (and the nsys CLI).

You can download the latest from https://developer.nvidia.com/nsight-systems and a new version was posted just yesterday (I have not even gotten around to posting an announcement in the forum yet).

Let me know if you need help getting started with the Nsight Systems CLI