Nsys Profiling error for distributed training of LLaMA 2 7B

I am currently trying to profile a distributed training for a LLAMA 2 7B model using 3xA6000 (48*3 GB) GPUs.

When I run the below command, the training runs without any issues:

tune run --nnodes 1 --nproc_per_node 3 full_finetune_distributed --config llama2/7B_full

However, when I run a profiling with NSYS, I get the following error:

nsys profile -t nvtx,cuda,osrt,cudnn,cublas,mpi --sample=cpu --stats=true --cudabacktrace=all --force-overwrite=true --python-sampling-frequency=1000 --python-sampling=true --cuda-memory-usage=true --python-backtrace=cuda --gpuctxsw=true --show-output=true --export=sqlite -o /home/logs/dlprof_training_profile_llama_july19/dlprof_training_1_bsz/nsys_profile tune run --nnodes 1 --nproc_per_node 3 full_finetune_distributed --config llama2/7B_full

Error:

 INFO:torchtune.utils.logging:Model is wrapped with FSDP at : 2024-07-19 01:27:17.021160
INFO:torchtune.utils.logging:Optimizer is initialized.
INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
  0%|                                                                            | 0/8667 [00:00<?, ?it/s]W0719 01:27:30.704000 140471808695552 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3648 closing signal SIGTERM
E0719 01:27:33.682000 140471808695552 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 3646) of binary: /root/miniconda3/envs/torchtune/bin/python
Running with torchrun...
Traceback (most recent call last):
  File "/root/miniconda3/envs/torchtune/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/home/research/torchtune/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/home/research/torchtune/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/home/research/torchtune/torchtune/_cli/run.py", line 177, in _run_cmd
    self._run_distributed(args)
  File "/home/research/torchtune/torchtune/_cli/run.py", line 88, in _run_distributed
    run(args)
  File "/root/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/research/torchtune/recipes/full_finetune_distributed.py FAILED
------------------------------------------------------------

Greetings,

I do not have experience with tune so I will need to do a little bit of research into it. However, typically when trying to do parallel profiling you should have the launcher run nsys, and not profile the launcher. See the docs for more information.

Based on your command line alone it isn’t obvious to me that is easily achieved though.

Are you able to profile tune without launching in parallel? That might be a good thing to confirm first before moving forward.