I tried to profile GitHub - NVIDIA/waveglow: A Flow-based Generative Network for Speech Synthesis by this command:
nv-nsight-cu-cli --export ./nsight_output ~/.virtualenvs/waveglow/bin/python3 inference.py -f <(ls mel_spectrograms/*.pt) -w waveglow_256channels.pt -o . --is_fp16 -s 0.6
Python command is from instruction of GitHub - NVIDIA/waveglow: A Flow-based Generative Network for Speech Synthesis ,
and it works with Nsight System, not Nsight Compute.
Process hangs printing this log:
...
==PROF== Profiling - 115: 0%....50%....100%
==PROF== Profiling - 116: 0%....50%....100%
==PROF== Profiling - 117: 0%....50%....100%
==PROF== Profiling - 118: 0%....50%....100%
==PROF== Profiling - 119: 0%....50%....100%
==PROF== Profiling - 120: 0%....50%....100%
==PROF== Profiling - 121: 0%....50%....100%
==PROF== Profiling - 122: 0%....50%....100%
==PROF== Profiling - 123: 0%....50%....100%
==PROF== Profiling - 124: 0%....50%....100%
==PROF== Profiling - 125: 0%....50%....100%
==PROF== Profiling - 126: 0%....50%....100%
==PROF== Profiling - 127: 0%....50%....100%
==PROF== Profiling - 128: 0%....50%....100%
==PROF== Profiling - 129: 0%....50%....100%
==PROF== Profiling - 130: 0%....50%....100%
==PROF== Profiling - 131: 0%....50%....100%
==PROF== Profiling - 132: 0%....50%....100%
==PROF== Profiling - 133: 0%....50%....100%
...
I tried workaround from https://devtalk.nvidia.com/default/topic/1049875/nsight-compute-/illegal-memory-access-during-nsight-compute-profiling/ , but it not worked.
Can you please let us know the exact version of Nsight Compute you are using, e.g. via running:
nv-nsight-cu-cli --version
Also, what OS and GPU is this on?
Thanks
Local: Nsight Compute GUI version (2019.3.1, Build 26317742) on macOS Mojave.
Remote: Nsight Compute CLI version 1.0, Build 24827263 on CentOS Linux 7 (core)
GPU: Tesla V100-PCIE-32GB
Sorry for missing information.
You should always use the same version of Nsight Compute on both the host and target systems, i.e. you cannot mix 2019.3.1 and 1.0. As for the hang, I recommend to try with 2019.3.1 first.
I updated Nsight Compute version to 2019.3, and I got this log:
...
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 286: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 287: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 288: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 289: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 290: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 291: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 292: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 293: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 294: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 295: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 296: 0%....50%...^C
==PROF== Received signal, trying to shutdown target application
- 43 passes
==ERROR== Failed to profile kernel "weight_norm_fwd_first_dim_ker..." in process
==ERROR== An error occurred while trying to profile.
==ERROR== An error occurred while trying to profile
==PROF== Report: nsight_compute_result.nsight-cuprof-report
I pressed Ctrl+C because profiling hangs. And Nsight Compute profiles only one kernel. Is this a problem of PyTorch? or Nsight Compute?
I pressed Ctrl+C because profiling hangs
Do you mean it remains at kernel launch 296 for a long time, or does it keep profiling kernels? You can limit the number of profiled kernels using e.g. the -c command line option.
To mitigate any hang due to metric collection, you can try reducing the set of metrics by selecting only one or multiple sections from the available list. Try e.g. by passing
--section SpeedOfLight
on the command line to collect only high-level performance metrics.
And Nsight Compute profiles only one kernel
Do you have reason to believe that more kernels are launched by the application? You can use e.g. Nsight Systems to trace the application and check which kernels are launched?
Tracing this with nvprof (or Nsight Systems), I see that this application launches 300 instances of the weight_norm_fwd_first_dim_kernel kernel. It is also the first kernel to be launched by the app, so you won’t see any other kernels being profiled until after the 300 instances. You will very likely just need to wait longer for it to profile other kernels, too. You can also tell Nsight Compute to e.g. skip the first 300 kernel launches using the -s option.
Not sure if this helps, but:
I had a similar hang issue with Nsight Compute CLI profiling a Pytorch ML training iteration. I left the profiler running for over a day and it was still getting stuck on the first kernel profiling. I suspected it was because the application almost exhausted all the GPU memory, and Nsight also required extra memory to store profiling data but didn’t throw an out-of-mem error. So I reduced the batch size of the training job slightly to reduce the memory usage and the profiling completed normally.