Question when Prifilling Megatron-LM

lunsheng231 · November 6, 2025, 9:36am

I am profiling a multi-node Megatron-LM training run, but my nsys report is only showing the CUDA HW Timeline for one single GPU (or one rank), despite running the job across 2 servers with a total of 8 GPUs (4 GPUs per node). The hardware metrics for the remaining 3 GPUs are missing.

Environment and Parallelism

Framework: Megatron-LM (using pretrain_gpt.py script via torchrun).
Total GPUs: 8 (2 nodes x 4 GPUs/node).
Parallelism: Tensor Parallel (TP) = 2, Pipeline Parallel (PP) = 4.
Target Script: bash ./examples/llama/train_llama3_8b_2nodes_fp16_144_145.sh(can be found in Megatron-LM/examples/llama/train_llama3_8b_h100_fp8.sh at main · NVIDIA/Megatron-LM · GitHub)

nsys profile -s none -t nvtx,cuda,osrt,cudnn --nic-metrics true -o ./profile/nsight_report_TP4PP2DP2NIC_node1 --force-overwrite true --capture-range=cudaProfilerApi --capture-range-end=stop bash ./examples/llama/train_llama3_8b_2nodes_fp16_144_145.sh

How should I modify the nsys profiling strategy to reliably capture CUDA HW metrics for all 4 GPUs on the local node (Node 1), or ideally, for all 8 GPUs across both nodes?

hwilper · November 6, 2025, 2:18pm

@pkovalenko to respond.

liuyis · November 7, 2025, 4:44pm

Hi @lunsheng231 , if you expand the threads row for process 28791, are there any “CUDA API” rows under any of the threads? And if so, are there any “cu(da)LaunchKernel()” API calls on the row?

Is there any warning/error in the diagnoistic messages (upper right corner of the report window)?

Is it possible to share the report for us to take a look?

lunsheng231 · November 8, 2025, 3:11am

Thank you for the suggestion. I expands the process 28791, and I found no CUDA API traced under this process.

The Diagnostics Summary section contained two critical warnings:

“Not all NVTX events might have been collected.“
“CUDA profiling might have not been started correctly.“

Is there something wrong with my nysy profile command?

The file I am analyzing is attached below.

nsight_report_TP2PP4DP1NIC_node1.nsys-rep.zip (5.9 MB)

liuyis · November 10, 2025, 5:41pm

Thank you. Based on OSRT backtraces I do see there should have been some CUDA events from other processes that are not currently showing any:

Is it possible to try the following and see if there’s any difference?

Change --capture-range-end=stop to --capture-range-end=none in your command
Add –-session-new=my_profile_session in your command
After the collection is started, wait for about 15 seconds (i.e. the duration of the report you shared), and run nsys stop --session=my_profile_session to manually stop the collection.
Check the generated report

The reason is that I’m suspecting currently only the process 28789 that called cudaProfilerStop() API had a chance to properly flushed the buffers holding CUDA events. Other processes might not have a chance to flush it and caused events to be lost (if that’s the case, it’s an issue in Nsys we need to look into). By setting --capture-range-end=none and manually calling nsys stop, we can force the buffers in all processes to be flushed so we can confirm/rule out this theory (and you can use it as an WAR if it works).

lunsheng231 · November 11, 2025, 6:18am

Thank you. I try again with my original nsys profile command as follows:

nsys profile -s none -t nvtx,cuda,osrt --nic-metrics true  -o ./profile/nsight_report_TP2PP4DP1NIC_node1 --force-overwrite true --capture-range=cudaProfilerApi --capture-range-end=stop bash ./examples/llama/train_llama3_8b_h100_2nodes_144_145.sh

This time I could see 2 CUDA HW while I am using 4 GPU on a single server. So the issue of incomplete GPU count still persists.

Then, as you suggested, I tried to manually stop the profile session:

nsys profile -s none -t nvtx,cuda,osrt --nic-metrics true  -o ./profile/nsight_report_TP2PP4DP1NIC_node1 --force-overwrite true --capture-range=cudaProfilerApi --capture-range-end=none –-session-new=my_profile_session bash ./examples/llama/train_llama3_8b_h100_2nodes_144_145.sh

And in another terminal, I attempted to stop the profiling using:

nsys stop --session=my_profile_session

But this seem not work, so I just stopped my training command. Then I get my profilling output.

Under the same configuration for my Megatron-LM Llama3 pre-trainging, I can observe that one entire training GlobalStep within the first 15 seconds. I stop the command later, and the resulting profile is missing some events, given that these two Global Steps should ideally be identical.

liuyis · November 12, 2025, 3:21pm

But this seem not work, so I just stopped my training command.

The nsys stop --session=my_profile_session should stop the collection and generate a report, but the target app will keep running and needs to be manually terminated.

Under the same configuration for my Megatron-LM Llama3 pre-trainging, I can observe that one entire training GlobalStep within the first 15 seconds.

The report confirms the theory - in capture range mode, only the process invoking cudaProfilerStop() has a chance to flush buffers, other processes are more like being terminated forcibly and will lose some events. We will need to look into it internally (I have opened a ticket DTSP-21073 in our internal tracking system). Please manually call nsys stop as an WAR for now.

I stop the command later, and the resulting profile is missing some events, given that these two Global Steps should ideally be identical.

Is it possible that process 56791 hasn’t really executed those kernels by the time you stopped?

lunsheng231 · November 14, 2025, 1:27am

Sorry for the late reply. Thanks for your help.
By manually stopping the profile session, I am able to successfully capture all 4 GPUs within my server.

system · November 28, 2025, 1:27am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nsys for multi GPU apps Profiling Linux Targets	1	1407	September 10, 2018
Parallel Nsight CUDA Programming and Performance	0	674	May 18, 2011
Nsight Systems Missing CUDA Info in Multi-Process Profiling Profiling Linux Targets cuda , nsight	2	2306	March 30, 2023
Nsys profiling MPI jobs Profiling Linux Targets nsight , hpc	1	2543	November 7, 2020
Nsys Does not Show the kernels output Profiling Embedded Targets	21	3462	October 20, 2022
Generating CUPTI_* tables with nsys Profiling Linux Targets cuda	25	1847	January 12, 2023
"No Events Captured" - When using Nsight 2.2 analysis tool with vs2010 Nsight Visual Studio Edition	3	1948	March 11, 2013
nvprof: Internal profiling error 4277:5 on Tesla P100, but not on GTX 1070 Visual Profiler and nvprof	12	4072	October 12, 2021
Multi Node Profiling with Nsight Systems Profiling Linux Targets	7	1388	July 8, 2024
If nsys has an option similar to ‘–profile-all-processes’?(Not getting cuda information from child processes on Linux Profiling Linux Targets nsight	8	2078	July 12, 2024

Question when Prifilling Megatron-LM

Related topics