Nsight compute hanging issue

I am using nsight compute to analyze a LLM training job, but it hangs and fails to launch the workload.
The error message is "
==WARNING== Launching the workload is taking more time than expected. If this continues to hang, terminate the profile and re-try by profiling the range of all related launches using ‘–replay-mode range’"

Could you please provide me with some guidance? Thanks in advance!

Hi, @cfrancisy

Thanks for using the tool. You should be able to control the overhead by using fewer metrics or analyzing fewer kernels. See Kernel Profiling Guide :: Nsight Compute Documentation

Hi, Veraj.

Thanks for your reply.

I have reduced the number of metrics, only using InstructionStats. But it still doesn’t work

Hi, @cfrancisy

The behavior you are seeing is currently expected for mandatory concurrent kernels such as nccl allreduce. This happens as kernel execution is serialized when profiling with the kernel replay mode.Feature to profile nccl is supported using the new app-range(Kernel Profiling Guide :: Nsight Compute Documentation) replay mode starting from NCU version 2023.1 (CUDA 12.1) . The new app-range replay mode profiles ranges without API capture by relaunching the entire application multiple times. After setting an appropriate range (using profiler start/stop API or NVTX ranges), such applications can now be profiled with --replay-mode app-range . This may need application code changes if you do not already have start/stop APIs or NVTX APIs at the appropriate points in the code.

Hi, @veraj

Thanks for your prompt reply.

Maybe I need to spend some time reading the document you shared. Currently, I cannot understand the app range instantly.

I am using Pytorch FSDP for LLM training. According to your document, it seems that I need to add the cu(da)ProfilerStart/Stop marker in the underlying cuda code of Pytorch. Correct?

Yes. If you need to profile the range, you need to add the code to specify the range.

Hi, @veraj

Thanks! I will have a try and add some markers!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.