I am using nsight compute to analyze a LLM training job, but it hangs and fails to launch the workload.
The error message is "
==WARNING== Launching the workload is taking more time than expected. If this continues to hang, terminate the profile and re-try by profiling the range of all related launches using ‘–replay-mode range’"
The behavior you are seeing is currently expected for mandatory concurrent kernels such as nccl allreduce. This happens as kernel execution is serialized when profiling with the kernel replay mode.Feature to profile nccl is supported using the new app-range(Kernel Profiling Guide :: Nsight Compute Documentation) replay mode starting from NCU version 2023.1 (CUDA 12.1) . The new app-range replay mode profiles ranges without API capture by relaunching the entire application multiple times. After setting an appropriate range (using profiler start/stop API or NVTX ranges), such applications can now be profiled with --replay-mode app-range . This may need application code changes if you do not already have start/stop APIs or NVTX APIs at the appropriate points in the code.
Maybe I need to spend some time reading the document you shared. Currently, I cannot understand the app range instantly.
I am using Pytorch FSDP for LLM training. According to your document, it seems that I need to add the cu(da)ProfilerStart/Stop marker in the underlying cuda code of Pytorch. Correct?