Random Freezing Trying to Profile Megatron-LM on Multiple GPUs

Right now I am trying to profile the training iteration of Megatron-LM. It works for a single GPU, but when I try to run it with 2 GPUs on the ncu CLI, it randomly freezes during initialization/setup.

If it does reach the training iteration section, it is able to successfully profile the entire application pass, but then freezes on the next application pass as shown below:

==PROF== Disconnected from process 3017557
==PROF== Disconnected from process 3017558
[2024-06-21 16:23:36,106] torch.distributed.run: [WARNING] 
[2024-06-21 16:23:36,106] torch.distributed.run: [WARNING] *****************************************
[2024-06-21 16:23:36,106] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-06-21 16:23:36,106] torch.distributed.run: [WARNING] *****************************************
==PROF== Connected to process 3021418 (/dtpatha/gohil01/tools/conda/envs/Megatron-LM_pyEnv/bin/python3.8)
==PROF== Connected to process 3021419 (/dtpatha/gohil01/tools/conda/envs/Megatron-LM_pyEnv/bin/python3.8)
Zarr-based strategies will not be registered because of missing packages

I also tried kernel replay, but get this error:

I’ve tried range replay, but it has unsupported APIs. Application range replay also randomly freezes during profiling

I’ve linked my profiling script here: Megatron-LM/profilingScript.sh at testProfiles · sgohil3/Megatron-LM · GitHub where I am currently calling the command below to run ncu.

./profilingScript.sh NCU_MULTI DP_TP 2 1

Any help would be appreciated,

Hi, @gohilshiv1

Sorry for the issue you met.
We’ll check details and see if we can repro internally.
By the way, which Nsight compute/driver version and GPU do you use ?

I used nvidia-smi for the GPU/Driver Version:

For Nsight Compute, I am using Version 2024.1.1.0 (build 33998838)

Hi, @gohilshiv1

We can see the randomly hang issue also during application replay. This is due to this is an indeterministic app, so kernel replay is recommended. Refer 2. Kernel Profiling Guide — NsightCompute 12.5 documentation

Regarding to kernel replay, we also tried, we have already profiled more than 2 hours, no issue met.

I see in your screenshot, you also profiled 451 kernels. So can you try to reduce the overhead of the profile by specifying some filter in command line 2. Kernel Profiling Guide — NsightCompute 12.5 documentation

I reduced the overhead to only profile the exact metrics that I am most interested at the moment with kernel replay:

--metric l1tex__m_l1tex2xbar_write_bytes,l1tex__m_xbar2l1tex_read_bytes,dram__bytes_write,dram__bytes_read,pcie__read_bytes,pcie__write_bytes,sm__sass_l1tex_m_xbar2l1tex_read_bytes_mem_global_op_ldgsts_cache_bypass,lts__t_sectors_srcunit_ltcfabric

However, it still seems to crash due to some NCCL issue:

Also why is the application replay non-deterministic on multi-GPUs, but it is deterministic on a single GPU?

Hi, @gohilshiv1

NCCL kernels are commonly mandatory concurrent, meaning that multiple kernels of the same NCCL API call (e.g. AllReduce) need to run at the same time to make forward progress.

When using either kernel or application replay, this is not possible, as individual kernels are serialized. For this purpose, you can select range or app-range replay modes.

I see you tried range replay but reported some API calls not being supported.
Maybe the range was defined too wide in the app.

Note that as of today, profiling mandatory concurrent kernels is only supported within the same process, as ncu always serialized workloads, including ranges, between processes. Therefore, if NCCL is setup to span multiple processes from the same NCCL API call, there is currently no way to profile this with ncu. The best option is to use GPU Metric Sampling in Nsight Systems in this case.

I tried reducing the range to only include a single forward pass, but it still errored out(range replay)/froze(app-range replay).

Also I was able to profile the first/second ncclKernel_AllReduce_RING_LL by splitting my profiling into multiple chunks of 340 kernels, but the third one freezes. From what I understand, I should use Nsight Systems for that third one as Nsight Compute doesn’t support it? If so, am I able to extract the cache performance metrics with NSYS?


Hi, @gohilshiv1

Regarding to Nsys usage, please raise a topic in Nsight System forum to get better support. Thanks !

Please refer the Nsight Systems document for Available metrics.