==WARNING== Please consult the documentation for current range replay limitations and requirements. [2023-01-22 17:06:23,748] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-01-22 17:06:23,800] [INFO] [runner.py:508:main] cmd = /home/sihwa/anaconda3/envs/deep/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 ms.py --model_type gpt2 --model_name_or_path models/custom --length 13 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --context 10 --num_sample_to_iter 1 --per_device_eval_batch_size 1 --custom_model ==PROF== Target process 97523 terminated before first instrumented API call. ==PROF== Target process 97524 terminated before first instrumented API call. [2023-01-22 17:06:26,065] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-01-22 17:06:26,065] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-01-22 17:06:26,065] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]}) [2023-01-22 17:06:26,065] [INFO] [launch.py:162:main] dist_world_size=4 [2023-01-22 17:06:26,065] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 ==PROF== Target process 97595 terminated before first instrumented API call. ==PROF== Target process 97596 terminated before first instrumented API call. ==PROF== Connected to process 97606 (/home/sihwa/anaconda3/envs/deep/bin/python3.9) ==PROF== Connected to process 97607 (/home/sihwa/anaconda3/envs/deep/bin/python3.9) ==PROF== Connected to process 97608 (/home/sihwa/anaconda3/envs/deep/bin/python3.9) ==PROF== Connected to process 97609 (/home/sihwa/anaconda3/envs/deep/bin/python3.9) ==PROF== Target process 97862 terminated before first instrumented API call. ==PROF== Target process 97863 terminated before first instrumented API call. ==PROF== Target process 97864 terminated before first instrumented API call. ==PROF== Target process 97865 terminated before first instrumented API call. ==PROF== Target process 97866 terminated before first instrumented API call. ==PROF== Target process 97869 terminated before first instrumented API call. ==PROF== Target process 97870 terminated before first instrumented API call. ==PROF== Target process 97872 terminated before first instrumented API call. [2023-01-22 17:07:59,686] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.8.0+7e2103f8, git-hash=7e2103f8, git-branch=master [2023-01-22 17:07:59,909] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.8.0+7e2103f8, git-hash=7e2103f8, git-branch=master [2023-01-22 17:08:00,326] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.8.0+7e2103f8, git-hash=7e2103f8, git-branch=master [2023-01-22 17:08:00,326] [INFO] [comm.py:655:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-01-22 17:08:00,464] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.8.0+7e2103f8, git-hash=7e2103f8, git-branch=master [2023-01-22 17:08:17,307] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2023-01-22 17:08:17,307] [INFO] [logging.py:68:log_dist] [Rank 0] Creating ZeRO Offload [2023-01-22 17:08:17,403] [INFO] [utils.py:827:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2023-01-22 17:08:17,404] [INFO] [utils.py:828:see_memory_usage] MA 15.85 GB Max_MA 15.85 GB CA 15.85 GB Max_CA 16 GB [2023-01-22 17:08:17,404] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 186.58 GB, percent = 18.5% Parameter Offload: Total persistent parameters: 0 in 0 params [2023-01-22 17:08:17,519] [INFO] [utils.py:827:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2023-01-22 17:08:17,520] [INFO] [utils.py:828:see_memory_usage] MA 3.96 GB Max_MA 16.43 GB CA 19.81 GB Max_CA 20 GB [2023-01-22 17:08:17,520] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 186.52 GB, percent = 18.5% [2023-01-22 17:08:17,520] [INFO] [config.py:1008:print] DeepSpeedEngine configuration: [2023-01-22 17:08:17,520] [INFO] [config.py:1012:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-01-22 17:08:17,520] [INFO] [config.py:1012:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-01-22 17:08:17,520] [INFO] [config.py:1012:print] amp_enabled .................. False [2023-01-22 17:08:17,520] [INFO] [config.py:1012:print] amp_params ................... False [2023-01-22 17:08:17,520] [INFO] [config.py:1012:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] bfloat16_enabled ............. False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] checkpoint_parallel_write_pipeline False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] checkpoint_tag_validation_enabled True [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] checkpoint_tag_validation_fail False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] comms_config ................. [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] communication_data_type ...... None [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] curriculum_enabled_legacy .... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] curriculum_params_legacy ..... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] data_efficiency_enabled ...... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] dataloader_drop_last ......... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] disable_allgather ............ False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] dump_state ................... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] dynamic_loss_scale_args ...... None [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] eigenvalue_enabled ........... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] eigenvalue_gas_boundary_resolution 1 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] eigenvalue_layer_num ......... 0 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] eigenvalue_max_iter .......... 100 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] eigenvalue_stability ......... 1e-06 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] eigenvalue_tol ............... 0.01 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] eigenvalue_verbose ........... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] elasticity_enabled ........... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] fp16_auto_cast ............... None [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] fp16_enabled ................. False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] fp16_master_weights_and_gradients False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] global_rank .................. 0 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] grad_accum_dtype ............. None [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] gradient_accumulation_steps .. 1 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] gradient_clipping ............ 0.0 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] gradient_predivide_factor .... 1.0 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] initial_dynamic_scale ........ 4294967296 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] load_universal_checkpoint .... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] loss_scale ................... 0 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] memory_breakdown ............. False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] monitor_config ............... [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] optimizer_legacy_fusion ...... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] optimizer_name ............... None [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] optimizer_params ............. None [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] pld_enabled .................. False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] pld_params ................... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] prescale_gradients ........... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] scheduler_name ............... None [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] scheduler_params ............. None [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] sparse_attention ............. None [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] sparse_gradients_enabled ..... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] steps_per_print .............. 2000 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] train_batch_size ............. 4 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] train_micro_batch_size_per_gpu 1 [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] use_node_local_storage ....... False [2023-01-22 17:08:17,521] [INFO] [config.py:1012:print] wall_clock_breakdown ......... False [2023-01-22 17:08:17,522] [INFO] [config.py:1012:print] world_size ................... 4 [2023-01-22 17:08:17,522] [INFO] [config.py:1012:print] zero_allow_untested_optimizer False [2023-01-22 17:08:17,522] [INFO] [config.py:1012:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=150994944 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=135895449 param_persistence_threshold=0 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False [2023-01-22 17:08:17,522] [INFO] [config.py:1012:print] zero_enabled ................. True [2023-01-22 17:08:17,522] [INFO] [config.py:1012:print] zero_optimization_stage ...... 3 [2023-01-22 17:08:17,522] [INFO] [config.py:997:print_user_config] json = { "fp16": { "enabled": false }, "bf16": { "enabled": false }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "reduce_bucket_size": 1.509949e+08, "stage3_prefetch_bucket_size": 1.358954e+08, "stage3_param_persistence_threshold": 0 }, "steps_per_print": 2.000000e+03, "train_batch_size": 4, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false } 0 step => tensor([[ 796, 8074, 20272, 9106, 3876, 385, 796, 220, 198, 8074, 28043, 38623, 38032]], device='cuda:2')0 step => tensor([[ 796, 8074, 20272, 9106, 3876, 385, 796, 220, 198, 8074, 28043, 38623, 38032]], device='cuda:0') 0 step => tensor([[ 796, 8074, 20272, 9106, 3876, 385, 796, 220, 198, 8074, 28043, 38623, 38032]], device='cuda:3') 0 step => tensor([[ 796, 8074, 20272, 9106, 3876, 385, 796, 220, 198, 8074, 28043, 38623, 38032]], device='cuda:1') ==PROF== Disconnected from process 97608 ==PROF== Disconnected from process 97606 ==PROF== Disconnected from process 97609 ==PROF== Disconnected from process 97607 [2023-01-22 17:08:38,281] [INFO] [launch.py:350:main] Process 97608 exits successfully. [2023-01-22 17:08:38,282] [INFO] [launch.py:350:main] Process 97607 exits successfully. [2023-01-22 17:08:38,282] [INFO] [launch.py:350:main] Process 97606 exits successfully. [2023-01-22 17:08:38,282] [INFO] [launch.py:350:main] Process 97609 exits successfully. ==PROF== Target process 97525 terminated before first instrumented API call. ==WARNING== No ranges were profiled.