I am currently trying to profile a distributed training for a LLAMA 2 7B model using 3xA6000 (48*3 GB) GPUs.
When I run the below command, the training runs without any issues:
tune run --nnodes 1 --nproc_per_node 3 full_finetune_distributed --config llama2/7B_full
However, when I run a profiling with NSYS, I get the following error:
nsys profile -t nvtx,cuda,osrt,cudnn,cublas,mpi --sample=cpu --stats=true --cudabacktrace=all --force-overwrite=true --python-sampling-frequency=1000 --python-sampling=true --cuda-memory-usage=true --python-backtrace=cuda --gpuctxsw=true --show-output=true --export=sqlite -o /home/logs/dlprof_training_profile_llama_july19/dlprof_training_1_bsz/nsys_profile tune run --nnodes 1 --nproc_per_node 3 full_finetune_distributed --config llama2/7B_full
Error:
INFO:torchtune.utils.logging:Model is wrapped with FSDP at : 2024-07-19 01:27:17.021160
INFO:torchtune.utils.logging:Optimizer is initialized.
INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
0%| | 0/8667 [00:00<?, ?it/s]W0719 01:27:30.704000 140471808695552 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3648 closing signal SIGTERM
E0719 01:27:33.682000 140471808695552 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 3646) of binary: /root/miniconda3/envs/torchtune/bin/python
Running with torchrun...
Traceback (most recent call last):
File "/root/miniconda3/envs/torchtune/bin/tune", line 8, in <module>
sys.exit(main())
File "/home/research/torchtune/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/research/torchtune/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/research/torchtune/torchtune/_cli/run.py", line 177, in _run_cmd
self._run_distributed(args)
File "/home/research/torchtune/torchtune/_cli/run.py", line 88, in _run_distributed
run(args)
File "/root/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/torchtune/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/research/torchtune/recipes/full_finetune_distributed.py FAILED
------------------------------------------------------------