Nsight compute hanging issue

cfrancisy · February 21, 2024, 6:42am

I am using nsight compute to analyze a LLM training job, but it hangs and fails to launch the workload.
The error message is "
==WARNING== Launching the workload is taking more time than expected. If this continues to hang, terminate the profile and re-try by profiling the range of all related launches using ‘–replay-mode range’"

Could you please provide me with some guidance? Thanks in advance!

veraj · February 26, 2024, 5:11am

Hi, @cfrancisy

Thanks for using the tool. You should be able to control the overhead by using fewer metrics or analyzing fewer kernels. See Kernel Profiling Guide :: Nsight Compute Documentation

cfrancisy · February 26, 2024, 5:42am

Hi, Veraj.

Thanks for your reply.

I have reduced the number of metrics, only using InstructionStats. But it still doesn’t work

veraj · February 26, 2024, 6:51am

Hi, @cfrancisy

The behavior you are seeing is currently expected for mandatory concurrent kernels such as nccl allreduce. This happens as kernel execution is serialized when profiling with the kernel replay mode.Feature to profile nccl is supported using the new app-range(Kernel Profiling Guide :: Nsight Compute Documentation) replay mode starting from NCU version 2023.1 (CUDA 12.1) . The new app-range replay mode profiles ranges without API capture by relaunching the entire application multiple times. After setting an appropriate range (using profiler start/stop API or NVTX ranges), such applications can now be profiled with --replay-mode app-range . This may need application code changes if you do not already have start/stop APIs or NVTX APIs at the appropriate points in the code.

cfrancisy · February 26, 2024, 7:18am

Hi, @veraj

Thanks for your prompt reply.

Maybe I need to spend some time reading the document you shared. Currently, I cannot understand the app range instantly.

I am using Pytorch FSDP for LLM training. According to your document, it seems that I need to add the cu(da)ProfilerStart/Stop marker in the underlying cuda code of Pytorch. Correct?

veraj · February 26, 2024, 7:44am

Yes. If you need to profile the range, you need to add the code to specify the range.

cfrancisy · February 26, 2024, 7:51am

Hi, @veraj

Thanks! I will have a try and add some markers!

veraj · March 11, 2024, 7:52am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Question about profiling nccl kernels with Nsight Compute Nsight Compute	23	5761	December 24, 2025
Random Freezing Trying to Profile Megatron-LM on Multiple GPUs Nsight Compute	9	1072	July 22, 2024
Compute CLI hangs when profiling PyTorch application Nsight Compute	8	1947	August 6, 2019
Nsight-compute print "the application returned an error code (249)" Nsight Compute	5	1602	February 13, 2023
==WARNING== Launching the workload is taking more time than expected Nsight Compute	2	1281	November 15, 2023
Nsight compute 2023.2: consistent Launch Fails for one of the kernels Nsight Compute	3	1135	August 17, 2023
NCU hangs when trying to profile a multi gpu kernel Nsight Compute	4	561	January 8, 2025
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1752	July 27, 2023
Profiling fails on more than one gpu device Nsight Compute	9	1242	November 15, 2023
Nsight Compute with MPI: ‘No Kernels Were Profiled’ Warning and Hanging Issue Nsight Compute	3	280	March 31, 2025

Nsight compute hanging issue

Related topics