Right now I am trying to profile the training iteration of Megatron-LM. It works for a single GPU, but when I try to run it with 2 GPUs on the ncu CLI, it randomly freezes during initialization/setup.
If it does reach the training iteration section, it is able to successfully profile the entire application pass, but then freezes on the next application pass as shown below:
==PROF== Disconnected from process 3017557
==PROF== Disconnected from process 3017558
[2024-06-21 16:23:36,106] torch.distributed.run: [WARNING]
[2024-06-21 16:23:36,106] torch.distributed.run: [WARNING] *****************************************
[2024-06-21 16:23:36,106] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-06-21 16:23:36,106] torch.distributed.run: [WARNING] *****************************************
==PROF== Connected to process 3021418 (/dtpatha/gohil01/tools/conda/envs/Megatron-LM_pyEnv/bin/python3.8)
==PROF== Connected to process 3021419 (/dtpatha/gohil01/tools/conda/envs/Megatron-LM_pyEnv/bin/python3.8)
Zarr-based strategies will not be registered because of missing packages
Sorry for the issue you met.
We’ll check details and see if we can repro internally.
By the way, which Nsight compute/driver version and GPU do you use ?
NCCL kernels are commonly mandatory concurrent, meaning that multiple kernels of the same NCCL API call (e.g. AllReduce) need to run at the same time to make forward progress.
When using either kernel or application replay, this is not possible, as individual kernels are serialized. For this purpose, you can select range or app-range replay modes.
I see you tried range replay but reported some API calls not being supported.
Maybe the range was defined too wide in the app.
Note that as of today, profiling mandatory concurrent kernels is only supported within the same process, as ncu always serialized workloads, including ranges, between processes. Therefore, if NCCL is setup to span multiple processes from the same NCCL API call, there is currently no way to profile this with ncu. The best option is to use GPU Metric Sampling in Nsight Systems in this case.
I tried reducing the range to only include a single forward pass, but it still errored out(range replay)/froze(app-range replay).
Also I was able to profile the first/second ncclKernel_AllReduce_RING_LL by splitting my profiling into multiple chunks of 340 kernels, but the third one freezes. From what I understand, I should use Nsight Systems for that third one as Nsight Compute doesn’t support it? If so, am I able to extract the cache performance metrics with NSYS?