Nsys Profiling error for distributed training of LLaMA 2 7B

shmd.dev · July 19, 2024, 2:50am

I am currently trying to profile a distributed training for a LLAMA 2 7B model using 3xA6000 (48*3 GB) GPUs.

When I run the below command, the training runs without any issues:

tune run --nnodes 1 --nproc_per_node 3 full_finetune_distributed --config llama2/7B_full

However, when I run a profiling with NSYS, I get the following error:

nsys profile -t nvtx,cuda,osrt,cudnn,cublas,mpi --sample=cpu --stats=true --cudabacktrace=all --force-overwrite=true --python-sampling-frequency=1000 --python-sampling=true --cuda-memory-usage=true --python-backtrace=cuda --gpuctxsw=true --show-output=true --export=sqlite -o /home/logs/dlprof_training_profile_llama_july19/dlprof_training_1_bsz/nsys_profile tune run --nnodes 1 --nproc_per_node 3 full_finetune_distributed --config llama2/7B_full

Error:

 INFO:torchtune.utils.logging:Model is wrapped with FSDP at : 2024-07-19 01:27:17.021160
INFO:torchtune.utils.logging:Optimizer is initialized.
INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
  0%|                                                                            | 0/8667 [00:00<?, ?it/s]W0719 01:27:30.704000 140471808695552 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3648 closing signal SIGTERM
E0719 01:27:33.682000 140471808695552 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 3646) of binary: /root/miniconda3/envs/torchtune/bin/python
Running with torchrun...
Traceback (most recent call last):
  File "/root/miniconda3/envs/torchtune/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/home/research/torchtune/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/home/research/torchtune/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/home/research/torchtune/torchtune/_cli/run.py", line 177, in _run_cmd
    self._run_distributed(args)
  File "/home/research/torchtune/torchtune/_cli/run.py", line 88, in _run_distributed
    run(args)
  File "/root/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/research/torchtune/recipes/full_finetune_distributed.py FAILED
------------------------------------------------------------

mhallock · July 20, 2024, 3:12am

Greetings,

I do not have experience with tune so I will need to do a little bit of research into it. However, typically when trying to do parallel profiling you should have the launcher run nsys, and not profile the launcher. See the docs for more information.

Based on your command line alone it isn’t obvious to me that is easily achieved though.

Are you able to profile tune without launching in parallel? That might be a good thing to confirm first before moving forward.

Topic		Replies	Views
Nsys Profile VLLM Error Profiling Linux Targets machine-learning	2	745	May 15, 2024
Nsys dying with "Agent launcher failed." Profiling Linux Targets	14	1397	March 13, 2023
Profling a simple deep learning code : no python backtrace + cannot use cudnn trace Profiling x86 Windows Targets cudnn	19	1149	December 13, 2023
Error when generating nsys-rep Profiling Linux Targets cuda , kernel , nsight	4	943	May 3, 2023
Nsight-system can't recognize the conda enviroment when profile the application Profiling Linux Targets cuda	4	1156	March 2, 2023
NSIGHT SYSTEM: Runtime Error and reported QuadDCommon::NotFoundException Profiling Linux Targets nsight	13	6206	September 8, 2023
Error in nsys profiling of python code Profiling Linux Targets	4	440	April 25, 2024
Segfaults during Deep Learning Profiling [BUG?] Profiling Linux Targets	9	861	December 13, 2023
Can not get CUDA python backtrace Profiling Linux Targets	12	1963	May 7, 2023
No GPU associated to the given UUID Profiling Linux Targets	6	678	July 18, 2024

Nsys Profiling error for distributed training of LLaMA 2 7B

Related topics