Hi,
While working with ranges in a PyTorch code running on multi-gpu system, we see that the profiler hangs at some cases. Part of the that we have defined the ranges is:
torch.cuda.nvtx.range_push("step_D1_1")
# Forward pass real batch through D
output = netD(real_cpu).view(-1)
torch.cuda.nvtx.range_pop()
torch.cuda.nvtx.range_push("step_D1_2")
# Calculate loss on all-real batch
errD_real = criterion(output, label)
torch.cuda.nvtx.range_pop()
torch.cuda.nvtx.range_push("step_D1_3")
# Calculate gradients for D in backward pass
errD_real.backward()
D_x = output.mean().item()
torch.cuda.nvtx.range_pop()
We tried the following commands with the metric for gpu__cycles_elapsed
:
$ncu_path -c 200 --nvtx --nvtx-include "step_D1_1/" \
--metrics gpu__cycles_elapsed \
--replay-mode app-range --cache-control none --target-processes all -f -o test1 python3 main.py
$ncu_path -c 200 --nvtx --nvtx-include "step_D1_2/" \
--metrics gpu__cycles_elapsed \
--replay-mode app-range --cache-control none --target-processes all -f -o test2 python3 main.py
$ncu_path -c 200 --nvtx --nvtx-include "step_D1_3/" \
--metrics gpu__cycles_elapsed \
--replay-mode app-range --cache-control none --target-processes all -f -o test3 python3 main.py
Only the second command (range D1_2) finishes successfully:
Starting Training Loop...
==PROF== Profiling "range" - 0 (1/200): Application replay pass 1
[0/1][0/99] Loss_D: 1.8108 Loss_G: 2.3480 D(x): 0.3327 D(G(z)): 0.3594 / 0.1407
==PROF== Profiling "range" - 1 (2/200): Application replay pass 1
[0/1][1/99] Loss_D: 3.7181 Loss_G: 2.7408 D(x): 0.9820 D(G(z)): 0.9565 / 0.1019
==PROF== Profiling "range" - 2 (3/200): Application replay pass 1
[0/1][2/99] Loss_D: 2.4518 Loss_G: 5.1990 D(x): 0.9641 D(G(z)): 0.8667 / 0.0096
==PROF== Profiling "range" - 3 (4/200): Application replay pass 1
[0/1][3/99] Loss_D: 0.7467 Loss_G: 6.7249 D(x): 0.9246 D(G(z)): 0.4012 / 0.0024
...
But the other two commands, range D1_1 and D1_3 stuck at the following output:
Starting Training Loop...
==PROF== Profiling "range" - 0 (1/200): Application replay pass 1
==PROF== Profiling "range" - 1 (2/200): Application replay pass 1
We don’t see any progress after that.
The system has two 4090 GPUs and the profiler version is 2023.2.
Any idea on how to debug further?