Profiler stuck while profiling a range

Hi,
While working with ranges in a PyTorch code running on multi-gpu system, we see that the profiler hangs at some cases. Part of the that we have defined the ranges is:

        torch.cuda.nvtx.range_push("step_D1_1")
        # Forward pass real batch through D
        output = netD(real_cpu).view(-1)
        torch.cuda.nvtx.range_pop()

        torch.cuda.nvtx.range_push("step_D1_2")
        # Calculate loss on all-real batch
        errD_real = criterion(output, label)
        torch.cuda.nvtx.range_pop()

        torch.cuda.nvtx.range_push("step_D1_3")
        # Calculate gradients for D in backward pass
        errD_real.backward()
        D_x = output.mean().item()
        torch.cuda.nvtx.range_pop()

We tried the following commands with the metric for gpu__cycles_elapsed:

$ncu_path -c 200 --nvtx --nvtx-include "step_D1_1/" \
--metrics gpu__cycles_elapsed \
--replay-mode app-range --cache-control none --target-processes all -f -o test1 python3 main.py

$ncu_path -c 200 --nvtx --nvtx-include "step_D1_2/" \
--metrics gpu__cycles_elapsed \
--replay-mode app-range --cache-control none --target-processes all -f -o test2 python3 main.py

$ncu_path -c 200 --nvtx --nvtx-include "step_D1_3/" \
--metrics gpu__cycles_elapsed \
--replay-mode app-range --cache-control none --target-processes all -f -o test3 python3 main.py

Only the second command (range D1_2) finishes successfully:

Starting Training Loop...
==PROF== Profiling "range" - 0 (1/200): Application replay pass 1
[0/1][0/99]	Loss_D: 1.8108	Loss_G: 2.3480	D(x): 0.3327	D(G(z)): 0.3594 / 0.1407
==PROF== Profiling "range" - 1 (2/200): Application replay pass 1
[0/1][1/99]	Loss_D: 3.7181	Loss_G: 2.7408	D(x): 0.9820	D(G(z)): 0.9565 / 0.1019
==PROF== Profiling "range" - 2 (3/200): Application replay pass 1
[0/1][2/99]	Loss_D: 2.4518	Loss_G: 5.1990	D(x): 0.9641	D(G(z)): 0.8667 / 0.0096
==PROF== Profiling "range" - 3 (4/200): Application replay pass 1
[0/1][3/99]	Loss_D: 0.7467	Loss_G: 6.7249	D(x): 0.9246	D(G(z)): 0.4012 / 0.0024
...

But the other two commands, range D1_1 and D1_3 stuck at the following output:

Starting Training Loop...
==PROF== Profiling "range" - 0 (1/200): Application replay pass 1
==PROF== Profiling "range" - 1 (2/200): Application replay pass 1

We don’t see any progress after that.
The system has two 4090 GPUs and the profiler version is 2023.2.

Any idea on how to debug further?

Hi, @mahmood.nt

Thanks for reporting this ! We can reproduce this internally. Will let you know if there is any new update.