Question when Prifilling Megatron-LM

I am profiling a multi-node Megatron-LM training run, but my nsys report is only showing the CUDA HW Timeline for one single GPU (or one rank), despite running the job across 2 servers with a total of 8 GPUs (4 GPUs per node). The hardware metrics for the remaining 3 GPUs are missing.

Environment and Parallelism

nsys profile -s none -t nvtx,cuda,osrt,cudnn --nic-metrics true -o ./profile/nsight_report_TP4PP2DP2NIC_node1 --force-overwrite true --capture-range=cudaProfilerApi --capture-range-end=stop bash ./examples/llama/train_llama3_8b_2nodes_fp16_144_145.sh

How should I modify the nsys profiling strategy to reliably capture CUDA HW metrics for all 4 GPUs on the local node (Node 1), or ideally, for all 8 GPUs across both nodes?

@pkovalenko to respond.

1 Like

Hi @lunsheng231 , if you expand the threads row for process 28791, are there any “CUDA API” rows under any of the threads? And if so, are there any “cu(da)LaunchKernel()” API calls on the row?

Is there any warning/error in the diagnoistic messages (upper right corner of the report window)?

Is it possible to share the report for us to take a look?

Thank you for the suggestion. I expands the process 28791, and I found no CUDA API traced under this process.

The Diagnostics Summary section contained two critical warnings:

  • “Not all NVTX events might have been collected.“

  • “CUDA profiling might have not been started correctly.“

Is there something wrong with my nysy profile command?

The file I am analyzing is attached below.

nsight_report_TP2PP4DP1NIC_node1.nsys-rep.zip (5.9 MB)

Thank you. Based on OSRT backtraces I do see there should have been some CUDA events from other processes that are not currently showing any:

Is it possible to try the following and see if there’s any difference?

  1. Change --capture-range-end=stop to --capture-range-end=none in your command
  2. Add –-session-new=my_profile_session in your command
  3. After the collection is started, wait for about 15 seconds (i.e. the duration of the report you shared), and run nsys stop --session=my_profile_session to manually stop the collection.
  4. Check the generated report

The reason is that I’m suspecting currently only the process 28789 that called cudaProfilerStop() API had a chance to properly flushed the buffers holding CUDA events. Other processes might not have a chance to flush it and caused events to be lost (if that’s the case, it’s an issue in Nsys we need to look into). By setting --capture-range-end=none and manually calling nsys stop, we can force the buffers in all processes to be flushed so we can confirm/rule out this theory (and you can use it as an WAR if it works).

Thank you. I try again with my original nsys profile command as follows:

nsys profile -s none -t nvtx,cuda,osrt --nic-metrics true  -o ./profile/nsight_report_TP2PP4DP1NIC_node1 --force-overwrite true --capture-range=cudaProfilerApi --capture-range-end=stop bash ./examples/llama/train_llama3_8b_h100_2nodes_144_145.sh

This time I could see 2 CUDA HW while I am using 4 GPU on a single server. So the issue of incomplete GPU count still persists.

Then, as you suggested, I tried to manually stop the profile session:

nsys profile -s none -t nvtx,cuda,osrt --nic-metrics true  -o ./profile/nsight_report_TP2PP4DP1NIC_node1 --force-overwrite true --capture-range=cudaProfilerApi --capture-range-end=none –-session-new=my_profile_session bash ./examples/llama/train_llama3_8b_h100_2nodes_144_145.sh

And in another terminal, I attempted to stop the profiling using:

nsys stop --session=my_profile_session

But this seem not work, so I just stopped my training command. Then I get my profilling output.

Under the same configuration for my Megatron-LM Llama3 pre-trainging, I can observe that one entire training GlobalStep within the first 15 seconds. I stop the command later, and the resulting profile is missing some events, given that these two Global Steps should ideally be identical.

But this seem not work, so I just stopped my training command.

The nsys stop --session=my_profile_session should stop the collection and generate a report, but the target app will keep running and needs to be manually terminated.

Under the same configuration for my Megatron-LM Llama3 pre-trainging, I can observe that one entire training GlobalStep within the first 15 seconds.

The report confirms the theory - in capture range mode, only the process invoking cudaProfilerStop() has a chance to flush buffers, other processes are more like being terminated forcibly and will lose some events. We will need to look into it internally (I have opened a ticket DTSP-21073 in our internal tracking system). Please manually call nsys stop as an WAR for now.

I stop the command later, and the resulting profile is missing some events, given that these two Global Steps should ideally be identical.

Is it possible that process 56791 hasn’t really executed those kernels by the time you stopped?

Sorry for the late reply. Thanks for your help.
By manually stopping the profile session, I am able to successfully capture all 4 GPUs within my server.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.