Application Range Replay frozen

I am trying to use the app-range replay option as there are certain API calls that the range option errors out on.
However, the profiler seems to hang after a couple of ranges and stays there.

image
image

I’ve added a picture of the ncu options that I am using; I’ve confirmed that kernel and application replay works with these options, so why would app-range fail?

Thanks

Hi, @gohilshiv1
Sorry for the issue you met. But we can’t figure out the reason unless we have a repro.
Can you provide the repro ?

Hello,

I am using a forked version of the Megatron repo, linked here: Megatron-LM/megatron/training.py at testProfiles · sgohil3/Megatron-LM · GitHub

I am profiling the nvtx range starting at line 791 to line 900 for training.py

Thanks!

This is the profiling script that I am using if that helps: Megatron-LM/profilingScript.sh at testProfiles · sgohil3/Megatron-LM · GitHub

Thanks. We’ll try to reproduce internally.

Hi, @gohilshiv1

app-range replay requires deterministic program execution.

Wouldn’t that throw an error when it tries to run the second pass? This is freezing in the middle of the first pass