Hi, I’m trying to train my own model with multi-gpu.
However, TimeoutError
occurs.
Sysetm information
• Hardware: RTX3090
• Network Type: PointPillar
Here’s my results of tao info
Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit’]
format_version: 2.0
toolkit_version: 4.0.1
published_date: 03/06/2023
Here’s my output of tao pointpillars train -e $SPECS_DIR/pointpillars_cm.yaml -r $USER_EXPERIMENT_DIR -k $KEY --gpus 4
2023-06-02 11:52:37,440 [INFO] root: Registry: ['nvcr.io']
2023-06-02 11:52:37,481 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
2023-06-02 11:52:37,503 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ailab/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
python -m torch.distributed.launch --nproc_per_node=4 /opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py --cfg_file /workspace/tao-experiments/specs/pointpillars_cm.yaml --output_dir /workspace/tao-experiments/pointpillars --key tlt_encode --gpus 4
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py>", line 3, in <module>
File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py>", line 3, in <module>
File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py>", line 3, in <module>
File "<frozen pointcloud.pointpillars.scripts.train>", line 152, in <module>
File "<frozen pointcloud.pointpillars.scripts.train>", line 152, in <module>
File "<frozen pointcloud.pointpillars.scripts.train>", line 152, in <module>
File "<frozen pointcloud.pointpillars.scripts.train>", line 58, in main
File "<frozen pointcloud.pointpillars.scripts.train>", line 58, in main
File "<frozen pointcloud.pointpillars.scripts.train>", line 58, in main
File "<frozen pointcloud.pointpillars.pcdet.utils.common_utils>", line 156, in init_dist_pytorch
File "<frozen pointcloud.pointpillars.pcdet.utils.common_utils>", line 156, in init_dist_pytorch
File "<frozen pointcloud.pointpillars.pcdet.utils.common_utils>", line 156, in init_dist_pytorch
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler
store, rank, world_size = next(rendezvous_iterator)store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
tcp_store = TCPStore(hostname, port, world_size, False, timeout)
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
tcp_store = TCPStore(hostname, port, world_size, False, timeout)tcp_store = TCPStore(hostname, port, world_size, False, timeout)
TimeoutErrorTimeoutError: : The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
Traceback (most recent call last):
File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py>", line 3, in <module>
File "<frozen pointcloud.pointpillars.scripts.train>", line 152, in <module>
File "<frozen pointcloud.pointpillars.scripts.train>", line 58, in main
File "<frozen pointcloud.pointpillars.pcdet.utils.common_utils>", line 156, in init_dist_pytorch
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
tcp_store = TCPStore(hostname, port, world_size, False, timeout)
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 484) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-06-02_03:22:42
host : 996ce2b3a184
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 485)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-06-02_03:22:42
host : 996ce2b3a184
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 486)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-06-02_03:22:42
host : 996ce2b3a184
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 487)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-06-02_03:22:42
host : 996ce2b3a184
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 484)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
2023-06-02 12:22:43,135 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
And here’s my pointpillars_cm.yaml
CLASS_NAMES: ['Car']
DATA_CONFIG:
DATASET: 'GeneralPCDataset'
DATA_PATH: '/workspace/tao-experiments/data'
DATA_SPLIT: {
'train': train,
'test': val
}
INFO_PATH: {
'train': [infos_train.pkl],
'test': [infos_val.pkl],
}
BALANCED_RESAMPLING: False
POINT_FEATURE_ENCODING: {
encoding_type: absolute_coordinates_encoding,
used_feature_list: ['x', 'y', 'z', 'intensity'],
src_feature_list: ['x', 'y', 'z', 'intensity'],
}
POINT_CLOUD_RANGE: [-69.12, -39.68, -3, 69.12, 39.68, 1]
DATA_AUGMENTOR:
DISABLE_AUG_LIST: ['placeholder']
AUG_CONFIG_LIST:
- NAME: gt_sampling
DB_INFO_PATH:
- dbinfos_train.pkl
PREPARE: {
filter_by_min_points: ['Car:5'] #, 'Pedestrian:5', 'Cyclist:5'],
}
SAMPLE_GROUPS: ['Car:15'] #,'Pedestrian:15', 'Cyclist:15']
NUM_POINT_FEATURES: 4
DATABASE_WITH_FAKELIDAR: False
REMOVE_EXTRA_WIDTH: [0.0, 0.0, 0.0]
LIMIT_WHOLE_SCENE: False
- NAME: random_world_flip
ALONG_AXIS_LIST: ['x']
- NAME: random_world_rotation
WORLD_ROT_ANGLE: [-0.78539816, 0.78539816]
- NAME: random_world_scaling
WORLD_SCALE_RANGE: [0.95, 1.05]
DATA_PROCESSOR:
- NAME: mask_points_and_boxes_outside_range
REMOVE_OUTSIDE_BOXES: True
- NAME: shuffle_points
SHUFFLE_ENABLED: {
'train': True,
'test': False
}
- NAME: transform_points_to_voxels
VOXEL_SIZE: [0.16, 0.16, 4]
MAX_POINTS_PER_VOXEL: 32
MAX_NUMBER_OF_VOXELS: {
'train': 16000,
'test': 10000
}
NUM_WORKERS: 4
MODEL:
NAME: PointPillar
VFE:
NAME: PillarVFE
WITH_DISTANCE: False
USE_ABSLOTE_XYZ: True
USE_NORM: True
NUM_FILTERS: [64]
MAP_TO_BEV:
NAME: PointPillarScatter
NUM_BEV_FEATURES: 64
BACKBONE_2D:
NAME: BaseBEVBackbone
LAYER_NUMS: [3, 5, 5]
LAYER_STRIDES: [2, 2, 2]
NUM_FILTERS: [64, 128, 256]
UPSAMPLE_STRIDES: [1, 2, 4]
NUM_UPSAMPLE_FILTERS: [128, 128, 128]
DENSE_HEAD:
NAME: AnchorHeadSingle
CLASS_AGNOSTIC: False
USE_DIRECTION_CLASSIFIER: True
DIR_OFFSET: 0.78539
DIR_LIMIT_OFFSET: 0.0
NUM_DIR_BINS: 2
ANCHOR_GENERATOR_CONFIG: [
{
'class_name': 'Car',
'anchor_sizes': [[4.64, 1.90, 1.38]],
'anchor_rotations': [-1.57, 1.57],
'anchor_bottom_heights': [-1.78],
'align_center': False,
'feature_map_stride': 2,
'matched_threshold': 0.6,
'unmatched_threshold': 0.45
}
# {
# 'class_name': 'Pedestrian',
# 'anchor_sizes': [[0.8, 0.6, 1.73]],
# 'anchor_rotations': [0, 1.57],
# 'anchor_bottom_heights': [-0.6],
# 'align_center': False,
# 'feature_map_stride': 2,
# 'matched_threshold': 0.5,
# 'unmatched_threshold': 0.35
# },
# {
# 'class_name': 'Cyclist',
# 'anchor_sizes': [[1.76, 0.6, 1.73]],
# 'anchor_rotations': [0, 1.57],
# 'anchor_bottom_heights': [-0.6],
# 'align_center': False,
# 'feature_map_stride': 2,
# 'matched_threshold': 0.5,
# 'unmatched_threshold': 0.35
# }
]
TARGET_ASSIGNER_CONFIG:
NAME: AxisAlignedTargetAssigner
POS_FRACTION: -1.0
SAMPLE_SIZE: 512
NORM_BY_NUM_EXAMPLES: False
MATCH_HEIGHT: False
BOX_CODER: ResidualCoder
LOSS_CONFIG:
LOSS_WEIGHTS: {
'cls_weight': 1.0,
'loc_weight': 2.0,
'dir_weight': 0.2,
'code_weights': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
}
POST_PROCESSING:
RECALL_THRESH_LIST: [0.3, 0.5, 0.7]
SCORE_THRESH: 0.1
OUTPUT_RAW_SCORE: False
EVAL_METRIC: kitti
NMS_CONFIG:
MULTI_CLASSES_NMS: False
NMS_TYPE: nms_gpu
NMS_THRESH: 0.01
NMS_PRE_MAXSIZE: 4096
NMS_POST_MAXSIZE: 500
SYNC_BN: False
OPTIMIZATION:
BATCH_SIZE_PER_GPU: 8
NUM_EPOCHS: 40
OPTIMIZER: adam_onecycle
LR: 0.003
WEIGHT_DECAY: 0.01
MOMENTUM: 0.9
MOMS: [0.95, 0.85]
PCT_START: 0.4
DIV_FACTOR: 10
DECAY_STEP_LIST: [35, 45]
LR_DECAY: 0.1
LR_CLIP: 0.0000001
LR_WARMUP: False
WARMUP_EPOCH: 1
GRAD_NORM_CLIP: 10
RESUME_MODEL_PATH: null
PRETRAINED_MODEL_PATH: null
PRUNED_MODEL_PATH: null
TCP_PORT: 18888
RANDOM_SEED: null
CKPT_INTERVAL: 1
MAX_CKPT_SAVE_NUM: 30
MERGE_ALL_ITERS_TO_ONE_EPOCH: False
EVALUATION:
BATCH_SIZE: 1
CKPT: "/workspace/tao-experiments/pointpillars/ckpt/checkpoint_epoch_40.tlt"
INFERENCE:
MAX_POINTS_NUM: 25000
BATCH_SIZE: 1
CKPT: "/workspace/tao-experiments/pointpillars/ckpt/checkpoint_epoch_40.tlt"
VIS_CONF_THRESH: 0.1