Compute CLI hangs when profiling PyTorch application

mhkim4886 · June 26, 2019, 7:16am

I tried to profile GitHub - NVIDIA/waveglow: A Flow-based Generative Network for Speech Synthesis by this command:

nv-nsight-cu-cli --export ./nsight_output ~/.virtualenvs/waveglow/bin/python3 inference.py -f <(ls mel_spectrograms/*.pt) -w waveglow_256channels.pt -o . --is_fp16 -s 0.6

Python command is from instruction of GitHub - NVIDIA/waveglow: A Flow-based Generative Network for Speech Synthesis ,
and it works with Nsight System, not Nsight Compute.

Process hangs printing this log:

...
==PROF== Profiling -  115: 0%....50%....100%
==PROF== Profiling -  116: 0%....50%....100%
==PROF== Profiling -  117: 0%....50%....100%
==PROF== Profiling -  118: 0%....50%....100%
==PROF== Profiling -  119: 0%....50%....100%
==PROF== Profiling -  120: 0%....50%....100%
==PROF== Profiling -  121: 0%....50%....100%
==PROF== Profiling -  122: 0%....50%....100%
==PROF== Profiling -  123: 0%....50%....100%
==PROF== Profiling -  124: 0%....50%....100%
==PROF== Profiling -  125: 0%....50%....100%
==PROF== Profiling -  126: 0%....50%....100%
==PROF== Profiling -  127: 0%....50%....100%
==PROF== Profiling -  128: 0%....50%....100%
==PROF== Profiling -  129: 0%....50%....100%
==PROF== Profiling -  130: 0%....50%....100%
==PROF== Profiling -  131: 0%....50%....100%
==PROF== Profiling -  132: 0%....50%....100%
==PROF== Profiling -  133: 0%....50%....100%
...

I tried workaround from https://devtalk.nvidia.com/default/topic/1049875/nsight-compute-/illegal-memory-access-during-nsight-compute-profiling/ , but it not worked.

felix_dt · June 26, 2019, 1:03pm

Can you please let us know the exact version of Nsight Compute you are using, e.g. via running:

nv-nsight-cu-cli --version

Also, what OS and GPU is this on?

Thanks

mhkim4886 · June 27, 2019, 1:13am

Local: Nsight Compute GUI version (2019.3.1, Build 26317742) on macOS Mojave.
Remote: Nsight Compute CLI version 1.0, Build 24827263 on CentOS Linux 7 (core)
GPU: Tesla V100-PCIE-32GB

Sorry for missing information.

felix_dt · June 27, 2019, 4:58am

You should always use the same version of Nsight Compute on both the host and target systems, i.e. you cannot mix 2019.3.1 and 1.0. As for the hang, I recommend to try with 2019.3.1 first.

mhkim4886 · June 28, 2019, 1:43am

I updated Nsight Compute version to 2019.3, and I got this log:

...
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 286: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 287: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 288: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 289: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 290: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 291: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 292: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 293: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 294: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 295: 0%....50%....100% - 48 passes
==PROF== Profiling "weight_norm_fwd_first_dim_ker..." - 296: 0%....50%...^C
==PROF== Received signal, trying to shutdown target application
 - 43 passes
==ERROR== Failed to profile kernel "weight_norm_fwd_first_dim_ker..." in process
==ERROR== An error occurred while trying to profile.
==ERROR== An error occurred while trying to profile
==PROF== Report: nsight_compute_result.nsight-cuprof-report

I pressed Ctrl+C because profiling hangs. And Nsight Compute profiles only one kernel. Is this a problem of PyTorch? or Nsight Compute?

felix_dt · June 28, 2019, 5:00am

I pressed Ctrl+C because profiling hangs
Do you mean it remains at kernel launch 296 for a long time, or does it keep profiling kernels? You can limit the number of profiled kernels using e.g. the -c command line option.

To mitigate any hang due to metric collection, you can try reducing the set of metrics by selecting only one or multiple sections from the available list. Try e.g. by passing

--section SpeedOfLight

on the command line to collect only high-level performance metrics.

And Nsight Compute profiles only one kernel
Do you have reason to believe that more kernels are launched by the application? You can use e.g. Nsight Systems to trace the application and check which kernels are launched?

mhkim4886 · July 2, 2019, 2:58am

It keep profiling kernels(period of (about) 1 second), but same kernel repeatedly.
I tried with --section SpeedOfLight, but I got same result.
I tested using Nsight Systems and checked that 17 kernels were launched.

felix_dt · July 3, 2019, 9:22am

Tracing this with nvprof (or Nsight Systems), I see that this application launches 300 instances of the weight_norm_fwd_first_dim_kernel kernel. It is also the first kernel to be launched by the app, so you won’t see any other kernels being profiled until after the 300 instances. You will very likely just need to wait longer for it to profile other kernels, too. You can also tell Nsight Compute to e.g. skip the first 300 kernel launches using the -s option.

serinatan · August 6, 2019, 7:38pm

Not sure if this helps, but:

I had a similar hang issue with Nsight Compute CLI profiling a Pytorch ML training iteration. I left the profiler running for over a day and it was still getting stuck on the first kernel profiling. I suspected it was because the application almost exhausted all the GPU memory, and Nsight also required extra memory to store profiling data but didn’t throw an out-of-mem error. So I reduced the batch size of the training job slightly to reduce the memory usage and the profiling completed normally.

Topic		Replies	Views
Nsight Compute remote connection problem Nsight Compute	2	1712	June 27, 2019
[Resolved] Invalid Nsight Compute	1	521	July 6, 2019
Profiling DCGan Tutorial Spins forever Nsight Compute	13	1275	June 7, 2020
Nv-nsight-cu-cli hangs on any binary Nsight Compute	8	1129	September 24, 2021
illegal memory access during nsight compute profiling Nsight Compute	13	1497	July 10, 2021
nsight compute ui and cli can't profiling any cuda application Nsight Compute	6	3927	August 21, 2019
Nv-nsight-cu-cli Does not work in NGC Pytorch container 20.03 Nsight Compute pytorch	1	572	April 21, 2020
Can't Get NCU GUI To Import Properly Nsight Compute	8	1480	October 5, 2020
[SOLVED] Nsight compute unable to connect 2070 super Nsight Compute	3	2977	September 3, 2019
Nsight Compute not detecting kernel launch Nsight Compute profiling	13	3243	May 6, 2021

Compute CLI hangs when profiling PyTorch application

Related topics