Train Pointpillar with Multi-GPU

_Supersu · June 5, 2023, 2:42am

Hi, I’m trying to train my own model with multi-gpu.
However, TimeoutError occurs.

Sysetm information
• Hardware: RTX3090
• Network Type: PointPillar

Here’s my results of tao info
Configuration of the TAO Toolkit Instance

dockers: [‘nvidia/tao/tao-toolkit’]
format_version: 2.0
toolkit_version: 4.0.1
published_date: 03/06/2023

Here’s my output of tao pointpillars train -e $SPECS_DIR/pointpillars_cm.yaml -r $USER_EXPERIMENT_DIR -k $KEY --gpus 4

2023-06-02 11:52:37,440 [INFO] root: Registry: ['nvcr.io']
2023-06-02 11:52:37,481 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
2023-06-02 11:52:37,503 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ailab/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
python -m torch.distributed.launch --nproc_per_node=4 /opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py --cfg_file /workspace/tao-experiments/specs/pointpillars_cm.yaml --output_dir /workspace/tao-experiments/pointpillars --key tlt_encode --gpus 4 
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py>", line 3, in <module>
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py>", line 3, in <module>
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py>", line 3, in <module>
  File "<frozen pointcloud.pointpillars.scripts.train>", line 152, in <module>
  File "<frozen pointcloud.pointpillars.scripts.train>", line 152, in <module>
  File "<frozen pointcloud.pointpillars.scripts.train>", line 152, in <module>
  File "<frozen pointcloud.pointpillars.scripts.train>", line 58, in main
  File "<frozen pointcloud.pointpillars.scripts.train>", line 58, in main
  File "<frozen pointcloud.pointpillars.scripts.train>", line 58, in main
  File "<frozen pointcloud.pointpillars.pcdet.utils.common_utils>", line 156, in init_dist_pytorch
  File "<frozen pointcloud.pointpillars.pcdet.utils.common_utils>", line 156, in init_dist_pytorch
  File "<frozen pointcloud.pointpillars.pcdet.utils.common_utils>", line 156, in init_dist_pytorch
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler
        store, rank, world_size = next(rendezvous_iterator)store, rank, world_size = next(rendezvous_iterator)

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
        store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
    tcp_store = TCPStore(hostname, port, world_size, False, timeout)
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
        tcp_store = TCPStore(hostname, port, world_size, False, timeout)tcp_store = TCPStore(hostname, port, world_size, False, timeout)

TimeoutErrorTimeoutError: : The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).

[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py>", line 3, in <module>
  File "<frozen pointcloud.pointpillars.scripts.train>", line 152, in <module>
  File "<frozen pointcloud.pointpillars.scripts.train>", line 58, in main
  File "<frozen pointcloud.pointpillars.pcdet.utils.common_utils>", line 156, in init_dist_pytorch
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
    tcp_store = TCPStore(hostname, port, world_size, False, timeout)
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 484) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-02_03:22:42
  host      : 996ce2b3a184
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 485)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-06-02_03:22:42
  host      : 996ce2b3a184
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 486)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-06-02_03:22:42
  host      : 996ce2b3a184
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 487)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-02_03:22:42
  host      : 996ce2b3a184
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 484)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
2023-06-02 12:22:43,135 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

And here’s my pointpillars_cm.yaml

CLASS_NAMES: ['Car']
DATA_CONFIG: 
    DATASET: 'GeneralPCDataset'
    DATA_PATH: '/workspace/tao-experiments/data'
    DATA_SPLIT: {
        'train': train,
        'test': val
    }
    INFO_PATH: {
        'train': [infos_train.pkl],
        'test': [infos_val.pkl],
    }
    BALANCED_RESAMPLING: False
    POINT_FEATURE_ENCODING: {
        encoding_type: absolute_coordinates_encoding,
        used_feature_list: ['x', 'y', 'z', 'intensity'],
        src_feature_list: ['x', 'y', 'z', 'intensity'],
    }
    POINT_CLOUD_RANGE: [-69.12, -39.68, -3, 69.12, 39.68, 1]
    DATA_AUGMENTOR:
        DISABLE_AUG_LIST: ['placeholder']
        AUG_CONFIG_LIST:
            - NAME: gt_sampling
              DB_INFO_PATH:
                  - dbinfos_train.pkl
              PREPARE: {
                 filter_by_min_points: ['Car:5'] #, 'Pedestrian:5', 'Cyclist:5'],
              }
              SAMPLE_GROUPS: ['Car:15'] #,'Pedestrian:15', 'Cyclist:15']
              NUM_POINT_FEATURES: 4
              DATABASE_WITH_FAKELIDAR: False
              REMOVE_EXTRA_WIDTH: [0.0, 0.0, 0.0]
              LIMIT_WHOLE_SCENE: False
            - NAME: random_world_flip
              ALONG_AXIS_LIST: ['x']
            - NAME: random_world_rotation
              WORLD_ROT_ANGLE: [-0.78539816, 0.78539816]
            - NAME: random_world_scaling
              WORLD_SCALE_RANGE: [0.95, 1.05]
    DATA_PROCESSOR:
        - NAME: mask_points_and_boxes_outside_range
          REMOVE_OUTSIDE_BOXES: True
        - NAME: shuffle_points
          SHUFFLE_ENABLED: {
              'train': True,
              'test': False
          }
        - NAME: transform_points_to_voxels
          VOXEL_SIZE: [0.16, 0.16, 4]
          MAX_POINTS_PER_VOXEL: 32
          MAX_NUMBER_OF_VOXELS: {
              'train': 16000,
              'test': 10000
          }
    NUM_WORKERS: 4

MODEL:
    NAME: PointPillar
    VFE:
        NAME: PillarVFE
        WITH_DISTANCE: False
        USE_ABSLOTE_XYZ: True
        USE_NORM: True
        NUM_FILTERS: [64]
    MAP_TO_BEV:
        NAME: PointPillarScatter
        NUM_BEV_FEATURES: 64
    BACKBONE_2D:
        NAME: BaseBEVBackbone
        LAYER_NUMS: [3, 5, 5]
        LAYER_STRIDES: [2, 2, 2]
        NUM_FILTERS: [64, 128, 256]
        UPSAMPLE_STRIDES: [1, 2, 4]
        NUM_UPSAMPLE_FILTERS: [128, 128, 128]
    DENSE_HEAD:
        NAME: AnchorHeadSingle
        CLASS_AGNOSTIC: False
        USE_DIRECTION_CLASSIFIER: True
        DIR_OFFSET: 0.78539
        DIR_LIMIT_OFFSET: 0.0
        NUM_DIR_BINS: 2
        ANCHOR_GENERATOR_CONFIG: [
            {
                'class_name': 'Car',
                'anchor_sizes': [[4.64, 1.90, 1.38]],
                'anchor_rotations': [-1.57, 1.57],
                'anchor_bottom_heights': [-1.78],
                'align_center': False,
                'feature_map_stride': 2,
                'matched_threshold': 0.6,
                'unmatched_threshold': 0.45
            }
            # {
            #     'class_name': 'Pedestrian',
            #     'anchor_sizes': [[0.8, 0.6, 1.73]],
            #     'anchor_rotations': [0, 1.57],
            #     'anchor_bottom_heights': [-0.6],
            #     'align_center': False,
            #     'feature_map_stride': 2,
            #     'matched_threshold': 0.5,
            #     'unmatched_threshold': 0.35
            # },
            # {
            #     'class_name': 'Cyclist',
            #     'anchor_sizes': [[1.76, 0.6, 1.73]],
            #     'anchor_rotations': [0, 1.57],
            #     'anchor_bottom_heights': [-0.6],
            #     'align_center': False,
            #     'feature_map_stride': 2,
            #     'matched_threshold': 0.5,
            #     'unmatched_threshold': 0.35
            # }
        ]
        TARGET_ASSIGNER_CONFIG:
            NAME: AxisAlignedTargetAssigner
            POS_FRACTION: -1.0
            SAMPLE_SIZE: 512
            NORM_BY_NUM_EXAMPLES: False
            MATCH_HEIGHT: False
            BOX_CODER: ResidualCoder
        LOSS_CONFIG:
            LOSS_WEIGHTS: {
                'cls_weight': 1.0,
                'loc_weight': 2.0,
                'dir_weight': 0.2,
                'code_weights': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
            }
    POST_PROCESSING:
        RECALL_THRESH_LIST: [0.3, 0.5, 0.7]
        SCORE_THRESH: 0.1
        OUTPUT_RAW_SCORE: False
        EVAL_METRIC: kitti
        NMS_CONFIG:
            MULTI_CLASSES_NMS: False
            NMS_TYPE: nms_gpu
            NMS_THRESH: 0.01
            NMS_PRE_MAXSIZE: 4096
            NMS_POST_MAXSIZE: 500
    SYNC_BN: False

OPTIMIZATION:
    BATCH_SIZE_PER_GPU: 8
    NUM_EPOCHS: 40
    OPTIMIZER: adam_onecycle
    LR: 0.003
    WEIGHT_DECAY: 0.01
    MOMENTUM: 0.9
    MOMS: [0.95, 0.85]
    PCT_START: 0.4
    DIV_FACTOR: 10
    DECAY_STEP_LIST: [35, 45]
    LR_DECAY: 0.1
    LR_CLIP: 0.0000001
    LR_WARMUP: False
    WARMUP_EPOCH: 1
    GRAD_NORM_CLIP: 10
    RESUME_MODEL_PATH: null 
    PRETRAINED_MODEL_PATH: null
    PRUNED_MODEL_PATH: null
    TCP_PORT: 18888
    RANDOM_SEED: null
    CKPT_INTERVAL: 1
    MAX_CKPT_SAVE_NUM: 30
    MERGE_ALL_ITERS_TO_ONE_EPOCH: False

EVALUATION:
    BATCH_SIZE: 1
    CKPT: "/workspace/tao-experiments/pointpillars/ckpt/checkpoint_epoch_40.tlt"

INFERENCE:
    MAX_POINTS_NUM: 25000
    BATCH_SIZE: 1
    CKPT: "/workspace/tao-experiments/pointpillars/ckpt/checkpoint_epoch_40.tlt"
    VIS_CONF_THRESH: 0.1

Morganh · June 5, 2023, 8:49am

_Supersu:

[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
Traceback (most recent call last):

It the TCP_PORT occupied? Could you try to set TCP_PORT with other value in the spec file , for example 8888?

_Supersu · June 5, 2023, 9:19am

I tried another TCP_PORT but it does not works (8888,28888, 38888)

So, I checked my PC’s port status
Here’s result of my sudo netstat -lntp

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/init              
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1435/sshd: /usr/sbi 
tcp        0      0 0.0.0.0:7070            0.0.0.0:*               LISTEN      1878/anydesk        
tcp        0      0 127.0.0.1:41403         0.0.0.0:*               LISTEN      46910/server        
tcp        0      0 127.0.0.1:41327         0.0.0.0:*               LISTEN      12745/python        
tcp        0      0 127.0.0.1:43779         0.0.0.0:*               LISTEN      27079/code          
tcp        0      0 127.0.0.1:44377         0.0.0.0:*               LISTEN      4672/app --no-sandb 
tcp        0      0 127.0.0.1:38539         0.0.0.0:*               LISTEN      9271/code           
tcp        0      0 127.0.0.1:33019         0.0.0.0:*               LISTEN      22968/server        
tcp        0      0 0.0.0.0:34016           0.0.0.0:*               LISTEN      1878/anydesk        
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      1290/systemd-resolv 
tcp        0      0 127.0.0.1:9003          0.0.0.0:*               LISTEN      12745/python        
tcp        0      0 127.0.0.1:9002          0.0.0.0:*               LISTEN      12745/python        
tcp        0      0 127.0.0.1:9001          0.0.0.0:*               LISTEN      12745/python        
tcp        0      0 127.0.0.1:9000          0.0.0.0:*               LISTEN      12745/python        
tcp        0      0 127.0.0.1:9004          0.0.0.0:*               LISTEN      12745/python        
tcp        0      0 127.0.0.1:6007          0.0.0.0:*               LISTEN      46860/python        
tcp        0      0 127.0.0.1:6006          0.0.0.0:*               LISTEN      22917/python        
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN      179501/cupsd        
tcp6       0      0 :::111                  :::*                    LISTEN      1/init              
tcp6       0      0 :::22                   :::*                    LISTEN      1435/sshd: /usr/sbi 
tcp6       0      0 ::1:631                 :::*                    LISTEN      179501/cupsd

Is there a another requirement eeenable multi-gpu training?

Morganh · June 7, 2023, 6:49am

To narrow down, could you use one old version of tao docker to check if it works?
Thanks.

$ docker run --runtime=nvidia -it --rm --entrypoint="" -v yourlocalfolder:dockerfolder nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3 /bin/bash

Then, run commands inside the docker.
root@5ae212fae249:/opt/nvidia/tools# pointpillars train -e xxx -r xxx -k xxx --gpus 4

_Supersu · June 10, 2023, 4:58pm

I tried old version of tao (3.22.05)
(Also, I use another workstation, RTX 2080ti 2way)
However, it does not works well, it has same error.

Here’s output of root@5ae212fae249:/opt/nvidia/tools# pointpillars train -e xxx -r xxx -k xxx --gpus 4

python /opt/conda/lib/python3.8/site-packages/pointcloud/pointpillars/scripts/train.pyc --cfg_file /workspace/tao-experiments/specs/pointpillars_cm.yaml --output_dir /workspace/tao-experiments/pp_label_v1 --key tlt_encode 
INFO: **********************Start logging**********************
INFO: CUDA_VISIBLE_DEVICES=ALL
INFO: Database filter by min points Car: 96791 => 86868
INFO: Loading point cloud dataset
INFO: Total samples for point cloud dataset: 18879
/opt/conda/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2156.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
INFO: **********************Start training**********************
epochs:   0%|                                                   | 0/40 [00:09<?, ?it/s]
Traceback (most recent call last):                            | 0/2360 [00:00<?, ?it/s]
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/scripts/train.py", line 152, in <module>
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/scripts/train.py", line 127, in main
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/tools/train_utils/train_utils.py", line 93, in train_model
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/tools/train_utils/train_utils.py", line 43, in train_one_epoch
  File "/opt/conda/lib/python3.8/site-packages/pointcloud/pointpillars/pcdet/models/__init__.py", line 142, in model_func
    ret_dict, tb_dict, disp_dict = model(batch_dict)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/pcdet/models/detectors/pointpillar.py", line 16, in forward
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/pcdet/models/backbones_2d/base_bev_backbone.py", line 102, in forward
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 168, in forward
    return F.batch_norm(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2381, in batch_norm
    return torch.batch_norm(
RuntimeError: CUDA out of memory. Tried to allocate 420.00 MiB (GPU 0; 10.75 GiB total capacity; 8.64 GiB already allocated; 283.31 MiB free; 8.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
                                                                                      root@52584131b630:/workspace/tao-experiments# ./train_3.22.5.sh^C                       
root@52584131b630:/workspace/tao-experiments# ./train_3.22.5.sh
python /opt/conda/lib/python3.8/site-packages/pointcloud/pointpillars/scripts/train.pyc --cfg_file /workspace/tao-experiments/specs/pointpillars_cm.yaml --output_dir /workspace/tao-experiments/pp_label_v1 --key tlt_encode 
INFO: **********************Start logging**********************
INFO: CUDA_VISIBLE_DEVICES=ALL
INFO: Database filter by min points Car: 96791 => 86868
INFO: Loading point cloud dataset
INFO: Total samples for point cloud dataset: 18879
/opt/conda/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2156.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
INFO: **********************Start training**********************
epochs:   0%|                          | 0/40 [10:57<?, ?it/s, loss=0.672, lr=0.000302]epochs:   0%|                          | 0/40 [10:57<?, ?it/s, loss=0.672, lr=0.000302]
Traceback (most recent call last):
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/scripts/train.py", line 152, in <module>
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/scripts/train.py", line 127, in main
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/tools/train_utils/train_utils.py", line 93, in train_model
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/tools/train_utils/train_utils.py", line 43, in train_one_epoch
  File "/opt/conda/lib/python3.8/site-packages/pointcloud/pointpillars/pcdet/models/__init__.py", line 142, in model_func
    ret_dict, tb_dict, disp_dict = model(batch_dict)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/pcdet/models/detectors/pointpillar.py", line 16, in forward
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/pcdet/models/backbones_3d/vfe/pillar_vfe.py", line 128, in forward
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/pcdet/models/backbones_3d/vfe/pillar_vfe.py", line 36, in forward
KeyboardInterrupt
                                                                                      Traceback (most recent call last):                                                      
  File "/opt/conda/bin/pointpillars", line 8, in <module>
    sys.exit(main())
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/entrypoint/pointpillars.py", line 126, in main
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/entrypoint/pointpillars.py", line 110, in launch
  File "/opt/conda/lib/python3.8/subprocess.py", line 359, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/opt/conda/lib/python3.8/subprocess.py", line 342, in call
    return p.wait(timeout=timeout)
  File "/opt/conda/lib/python3.8/subprocess.py", line 1083, in wait
    return self._wait(timeout=timeout)
  File "/opt/conda/lib/python3.8/subprocess.py", line 1806, in _wait
    (pid, sts) = self._try_wait(0)
  File "/opt/conda/lib/python3.8/subprocess.py", line 1764, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
^[[Aroot@52584131b630:/workspace/tao-experiments# ./train_3.22.5.sh --^C
root@52584131b630:/workspace/tao-experiments# ^C
root@52584131b630:/workspace/tao-experiments# ./train_3.22.5.sh
python -m torch.distributed.launch --nproc_per_node=2 /opt/conda/lib/python3.8/site-packages/pointcloud/pointpillars/scripts/train.pyc --cfg_file /workspace/tao-experiments/specs/pointpillars_cm.yaml --output_dir /workspace/tao-experiments/pp_label_v1 --key tlt_encode --gpus 2 
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/scripts/train.py", line 152, in <module>
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/scripts/train.py", line 152, in <module>
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/scripts/train.py", line 58, in main
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/scripts/train.py", line 58, in main
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/pcdet/utils/common_utils.py", line 156, in init_dist_pytorch
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/pcdet/utils/common_utils.py", line 156, in init_dist_pytorch
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 578, in init_process_group
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 578, in init_process_group
        store, rank, world_size = next(rendezvous_iterator)store, rank, world_size = next(rendezvous_iterator)

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 183, in _tcp_rendezvous_handler
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 183, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)    
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 153, in _create_c10d_store

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 153, in _create_c10d_store
    tcp_store = TCPStore(hostname, port, world_size, False, timeout)
RuntimeError: connect() timed out. Original timeout was 1800000 ms.    tcp_store = TCPStore(hostname, port, world_size, False, timeout)

RuntimeError: connect() timed out. Original timeout was 1800000 ms.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 911) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/conda/lib/python3.8/site-packages/pointcloud/pointpillars/scripts/train.pyc FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-10_16:40:58
  host      : 52584131b630
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 912)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-10_16:40:58
  host      : 52584131b630
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 911)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Morganh · June 11, 2023, 9:35am

Can you run training with 1 gpu successfully? Above log happens when you run 1 gpu, right?

_Supersu · June 11, 2023, 9:54am

Sorry, I mis-uploaded my log file, please ignore upper part.
It works well in 1-gpu case.

Here’s my script for training (multi-gpu)
$pointpillars train -e $SPECS_DIR/pointpillars_cm.yaml -r $USER_EXPERIMENT_DIR -k $KEY --gpus 2

This is output:

python -m torch.distributed.launch --nproc_per_node=2 /opt/conda/lib/python3.8/site-packages/pointcloud/pointpillars/scripts/train.pyc --cfg_file /workspace/tao-experiments/specs/pointpillars_cm.yaml --output_dir /workspace/tao-experiments/pp_label_v1 --key tlt_encode --gpus 2 
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/scripts/train.py", line 152, in <module>
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/scripts/train.py", line 152, in <module>
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/scripts/train.py", line 58, in main
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/scripts/train.py", line 58, in main
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/pcdet/utils/common_utils.py", line 156, in init_dist_pytorch
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/pointcloud/pointpillars/pcdet/utils/common_utils.py", line 156, in init_dist_pytorch
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 578, in init_process_group
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 578, in init_process_group
        store, rank, world_size = next(rendezvous_iterator)store, rank, world_size = next(rendezvous_iterator)

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 183, in _tcp_rendezvous_handler
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 183, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)    
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 153, in _create_c10d_store

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 153, in _create_c10d_store
    tcp_store = TCPStore(hostname, port, world_size, False, timeout)
RuntimeError: connect() timed out. Original timeout was 1800000 ms.    tcp_store = TCPStore(hostname, port, world_size, False, timeout)

RuntimeError: connect() timed out. Original timeout was 1800000 ms.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 911) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/conda/lib/python3.8/site-packages/pointcloud/pointpillars/scripts/train.pyc FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-10_16:40:58
  host      : 52584131b630
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 912)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-10_16:40:58
  host      : 52584131b630
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 911)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Morganh · June 11, 2023, 10:03am

To narrow down, did you ever run other networks successfully with your multi-gpus?

_Supersu · June 11, 2023, 12:45pm

No, PointPillar is my first network using TAO Toolkit.

So, I tried another network (segformer).
It uses nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt (PointPillar uses same container)

I set the NUM_GPU parameter into 2 and it works well.
I supposed that my server’s environment has no problem, pointpillar(based on openpcdet or pcdet) has some issues in multi-gpu training

Is there a another approach to enable pointpillar multi-gpu training ?

Here’s training log of SegFormer:

Train SegFormer Model
2023-06-11 21:35:17,482 [INFO] root: Registry: ['nvcr.io']
2023-06-11 21:35:17,540 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
2023-06-11 21:35:17,645 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ailab/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
INFO: Loading faiss with AVX2 support.
INFO: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
INFO: Loading faiss.
INFO: Successfully loaded faiss.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO: Loading faiss with AVX2 support.
INFO: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
INFO: Loading faiss.
INFO: Successfully loaded faiss.
INFO: Loading faiss with AVX2 support.
INFO: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
INFO: Loading faiss.
INFO: Successfully loaded faiss.
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
[NeMo W 2023-06-11 12:35:27 nemo_logging:349] <frozen cv.segformer.scripts.train>:105: UserWarning: 
    'train_isbi.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
[NeMo I 2023-06-11 12:35:27 <frozen core:20] Experiment configuration:
    encryption_key: '********'
    exp_config:
      manual_seed: 49
      distributed: true
      gpu_ids:
      - 0
      MASTER_ADDR: 127.0.0.1
      MASTER_PORT: 631
    model_config:
      pretrained: null
      backbone:
        type: mit_b1
      decode_head:
        in_channels:
        - 64
        - 128
        - 320
        - 512
        in_index:
        - 0
        - 1
        - 2
        - 3
        feature_strides:
        - 4
        - 8
        - 16
        - 32
        channels: 128
        dropout_ratio: 0.1
        norm_cfg:
          type: SyncBN
          requires_grad: true
        align_corners: false
        decoder_params:
          embed_dim: 768
        loss_decode:
          type: CrossEntropyLoss
          use_sigmoid: false
          loss_weight: 1.0
      test_cfg:
        mode: whole
        crop_size: null
        stride: null
    train_config:
      checkpoint_config:
        interval: 200
        by_epoch: false
      validation_config:
        interval: 500
      sf_optim:
        type: AdamW
        lr: 6.0e-05
        betas:
        - 0.9
        - 0.999
        weight_decay: 0.0005
        paramwise_cfg:
          pos_block:
            decay_mult: 0.0
          norm:
            decay_mult: 0.0
          head:
            lr_mult: 10.0
      lr_config:
        policy: poly
        warmup: linear
        warmup_iters: 1500
        warmup_ratio: 1.0e-06
        power: 1.0
        min_lr: 0.0
        by_epoch: false
      runner:
        type: EpochBasedRunner
        max_iters: 1000
        max_epochs: 20
      logging:
        interval: 50
        log_dir: logs
      grad_clip: 0.0
      find_unused_parameters: true
      resume_training_checkpoint_path: null
      validate: true
    dataset_config:
      data_root: /tlt-pytorch
      img_norm_cfg:
        mean:
        - 127.5
        - 127.5
        - 127.5
        std:
        - 127.5
        - 127.5
        - 127.5
        to_rgb: true
      val_img_dir: /data/images/val
      train_img_dirs:
      - /data/images/train
      test_img_dir: ???
      test_ann_dir: ???
      val_ann_dir: /data/masks/val
      train_ann_dirs:
      - /data/masks/train
      palette:
      - seg_class: foreground
        mapping_class: foreground
        label_id: 0
        rgb:
        - 0
        - 0
        - 0
      - seg_class: background
        mapping_class: background
        label_id: 1
        rgb:
        - 255
        - 255
        - 255
      seg_class_default:
        seg_class: background
        mapping_class: background
        label_id: 0
        rgb:
        - 255
        - 255
        - 255
      dataloader: Dataloader
      train_pipeline:
        img_norm_cfg:
          mean:
          - 123.675
          - 116.28
          - 103.53
          std:
          - 58.395
          - 57.12
          - 57.375
          to_rgb: true
        augmentation_config:
          random_crop:
            crop_size:
            - 512
            - 512
            cat_max_ratio: 0.75
          resize:
            img_scale:
            - 1024
            - 512
            ratio_range:
            - 0.5
            - 2.0
            keep_ratio: true
          random_flip:
            prob: 0.5
          color_aug:
            type: PhotoMetricDistortion
        Pad:
          size_ht: 512
          size_wd: 512
          pad_val: 0
          seg_pad_val: 255
        CollectKeys:
        - img
        - gt_semantic_seg
      test_pipeline:
        img_norm_cfg:
          mean:
          - 123.675
          - 116.28
          - 103.53
          std:
          - 58.395
          - 57.12
          - 57.375
          to_rgb: true
        multi_scale:
        - 2048
        - 1024
        augmentation_config:
          random_crop:
            crop_size:
            - 1024
            - 1024
            cat_max_ratio: 0.75
          resize:
            img_scale:
            - 1024
            - 1024
            ratio_range:
            - 0.5
            - 2.0
            keep_ratio: true
          random_flip:
            prob: 0.5
          color_aug:
            type: PhotoMetricDistortion
        Pad:
          size_ht: 1024
          size_wd: 1024
          pad_val: 0
          seg_pad_val: 255
        CollectKeys:
        - img
        - gt_semantic_seg
      val_pipeline:
        img_norm_cfg:
          mean:
          - 123.675
          - 116.28
          - 103.53
          std:
          - 58.395
          - 57.12
          - 57.375
          to_rgb: true
        multi_scale:
        - 2048
        - 1024
        augmentation_config:
          random_crop:
            crop_size:
            - 1024
            - 1024
            cat_max_ratio: 0.75
          resize:
            img_scale:
            - 1024
            - 1024
            ratio_range:
            - 0.5
            - 2.0
            keep_ratio: true
          random_flip:
            prob: 0.5
          color_aug:
            type: PhotoMetricDistortion
        Pad:
          size_ht: 1024
          size_wd: 1024
          pad_val: 0
          seg_pad_val: 255
        CollectKeys:
        - img
        - gt_semantic_seg
      repeat_data_times: 500
      batch_size_per_gpu: 4
      workers_per_gpu: 1
      shuffle: true
      input_type: grayscale
    output_dir: /results/isbi_experiment
    num_gpus: 2
    
2023-06-11 12:35:27,195 - mmseg - INFO - **********************Start logging for Training**********************
2023-06-11 12:35:27,221 - mmseg - INFO - **********************Start logging for Training**********************
2023-06-11 12:35:27,285 - mmseg - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]
CUDA available: True
GPU 0,1: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.86
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.13.0a0+d0d6b1f
PyTorch compiling details: PyTorch built with:
  - GCC 9.4
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash N/A)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_90,code=compute_90
  - CuDNN 8.6
  - Magma 2.6.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.6.0, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-fno-gnu-unique -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.14.0a0
OpenCV: 4.5.5
MMCV: 1.6.2
MMCV Compiler: GCC 9.4
MMCV CUDA Compiler: not available
------------------------------------------------------------

2023-06-11 12:35:27,286 - mmseg - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]
CUDA available: True
GPU 0,1: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.86
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.13.0a0+d0d6b1f
PyTorch compiling details: PyTorch built with:
  - GCC 9.4
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash N/A)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_90,code=compute_90
  - CuDNN 8.6
  - Magma 2.6.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.6.0, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-fno-gnu-unique -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.14.0a0
OpenCV: 4.5.5
MMCV: 1.6.2
MMCV Compiler: GCC 9.4
MMCV CUDA Compiler: not available
------------------------------------------------------------

2023-06-11 12:35:27,292 - mmseg - INFO - Label Id 0: Train Id 0
2023-06-11 12:35:27,292 - mmseg - INFO - Label Id 1: Train Id 1
2023-06-11 12:35:27,293 - mmseg - INFO - Completed Data Module Construction
Concatenating datasets
2023-06-11 12:35:27,297 - mmseg - INFO - Loaded 20 images
2023-06-11 12:35:27,302 - mmseg - INFO - Label Id 0: Train Id 0
2023-06-11 12:35:27,302 - mmseg - INFO - Label Id 1: Train Id 1
2023-06-11 12:35:27,302 - mmseg - INFO - Completed Data Module Construction
Concatenating datasets
2023-06-11 12:35:27,309 - mmseg - INFO - Loaded 20 images
[NeMo W 2023-06-11 12:35:29 nemo_logging:349] <frozen cv.segformer.trainer.trainer>:97: UserWarning: config is now expected to have a `runner` section, please set `runner` in your config.
    
2023-06-11 12:35:29,255 - mmseg - INFO - Loaded 10 images
2023-06-11 12:35:29,255 - mmseg - INFO - Start running, host: root@db0474b5e06b, work_dir: /results/isbi_experiment
2023-06-11 12:35:29,255 - mmseg - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) PolyLrUpdaterHook                  
(NORMAL      ) CheckpointHook                     
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) PolyLrUpdaterHook                  
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_iter:
(VERY_HIGH   ) PolyLrUpdaterHook                  
(LOW         ) IterTimerHook                      
 -------------------- 
after_train_iter:
(ABOVE_NORMAL) OptimizerHook                      
(NORMAL      ) CheckpointHook                     
(LOW         ) IterTimerHook                      
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) CheckpointHook                     
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_epoch:
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_run:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
2023-06-11 12:35:29,255 - mmseg - INFO - Loaded 10 images
2023-06-11 12:35:29,255 - mmseg - INFO - workflow: [('train', 1)], max: 1000 iters
2023-06-11 12:35:29,256 - mmseg - INFO - Checkpoints will be saved to /results/isbi_experiment by HardDiskBackend.
2023-06-11 12:35:29,256 - mmseg - INFO - Start running, host: root@db0474b5e06b, work_dir: /results/isbi_experiment
2023-06-11 12:35:29,256 - mmseg - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) PolyLrUpdaterHook                  
(NORMAL      ) CheckpointHook                     
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) PolyLrUpdaterHook                  
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_iter:
(VERY_HIGH   ) PolyLrUpdaterHook                  
(LOW         ) IterTimerHook                      
 -------------------- 
after_train_iter:
(ABOVE_NORMAL) OptimizerHook                      
(NORMAL      ) CheckpointHook                     
(LOW         ) IterTimerHook                      
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) CheckpointHook                     
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_epoch:
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_run:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
2023-06-11 12:35:29,256 - mmseg - INFO - workflow: [('train', 1)], max: 1000 iters
2023-06-11 12:35:29,256 - mmseg - INFO - Checkpoints will be saved to /results/isbi_experiment by HardDiskBackend.
INFO: Loading faiss with AVX2 support.
INFO: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
INFO: Loading faiss.
INFO: Loading faiss with AVX2 support.
INFO: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
INFO: Loading faiss.
INFO: Successfully loaded faiss.
INFO: Successfully loaded faiss.
[NeMo W 2023-06-11 12:35:35 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py:191: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
    grad.sizes() = [2, 768, 1, 1], strides() = [768, 1, 768, 768]
    bucket_view.sizes() = [2, 768, 1, 1], strides() = [768, 1, 1, 1] (Triggered internally at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/reducer.cpp:325.)
      Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    
2023-06-11 12:35:49,009 - mmseg - INFO - Iter [50/1000]	lr: 1.864e-06, eta: 0:06:14, time: 0.394, data_time: 0.078, memory: 4067, decode.loss_seg: 0.4918, decode.acc_seg: 60.3575, loss: 0.4918
2023-06-11 12:35:49,009 - mmseg - INFO - Iter [50/1000]	lr: 1.864e-06, eta: 0:06:14, time: 0.394, data_time: 0.078, memory: 4067, decode.loss_seg: 0.4918, decode.acc_seg: 60.3575, loss: 0.4918
2023-06-11 12:36:02,950 - mmseg - INFO - Iter [100/1000]	lr: 3.568e-06, eta: 0:05:02, time: 0.279, data_time: 0.004, memory: 4067, decode.loss_seg: 0.3478, decode.acc_seg: 69.4065, loss: 0.3478
2023-06-11 12:36:02,950 - mmseg - INFO - Iter [100/1000]	lr: 3.568e-06, eta: 0:05:02, time: 0.279, data_time: 0.004, memory: 4067, decode.loss_seg: 0.3478, decode.acc_seg: 69.4065, loss: 0.3478
2023-06-11 12:36:16,821 - mmseg - INFO - Iter [150/1000]	lr: 5.072e-06, eta: 0:04:29, time: 0.277, data_time: 0.004, memory: 4067, decode.loss_seg: 0.3108, decode.acc_seg: 72.2062, loss: 0.3108
2023-06-11 12:36:16,821 - mmseg - INFO - Iter [150/1000]	lr: 5.072e-06, eta: 0:04:29, time: 0.277, data_time: 0.004, memory: 4067, decode.loss_seg: 0.3108, decode.acc_seg: 72.2062, loss: 0.3108
2023-06-11 12:36:30,665 - mmseg - INFO - Saving checkpoint at 200 iterations
2023-06-11 12:36:30,672 - mmseg - INFO - Saving checkpoint at 200 iterations
2023-06-11 12:36:30,673 - mmseg - INFO - Iter [200/1000]	lr: 6.376e-06, eta: 0:04:05, time: 0.277, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2937, decode.acc_seg: 71.1181, loss: 0.2937
2023-06-11 12:36:31,752 - mmseg - INFO - Iter [200/1000]	lr: 6.376e-06, eta: 0:04:09, time: 0.299, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2937, decode.acc_seg: 71.1181, loss: 0.2937
2023-06-11 12:36:45,809 - mmseg - INFO - Iter [250/1000]	lr: 7.480e-06, eta: 0:03:49, time: 0.303, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2840, decode.acc_seg: 71.3148, loss: 0.2840
2023-06-11 12:36:45,809 - mmseg - INFO - Iter [250/1000]	lr: 7.480e-06, eta: 0:03:49, time: 0.281, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2840, decode.acc_seg: 71.3148, loss: 0.2840
2023-06-11 12:36:59,830 - mmseg - INFO - Iter [300/1000]	lr: 8.384e-06, eta: 0:03:31, time: 0.280, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2783, decode.acc_seg: 71.6153, loss: 0.2783
2023-06-11 12:36:59,830 - mmseg - INFO - Iter [300/1000]	lr: 8.384e-06, eta: 0:03:31, time: 0.280, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2783, decode.acc_seg: 71.6153, loss: 0.2783
2023-06-11 12:37:13,851 - mmseg - INFO - Iter [350/1000]	lr: 9.088e-06, eta: 0:03:14, time: 0.280, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2790, decode.acc_seg: 73.3939, loss: 0.2790
2023-06-11 12:37:13,851 - mmseg - INFO - Iter [350/1000]	lr: 9.088e-06, eta: 0:03:14, time: 0.281, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2790, decode.acc_seg: 73.3939, loss: 0.2790
2023-06-11 12:37:27,873 - mmseg - INFO - Saving checkpoint at 400 iterations
2023-06-11 12:37:27,877 - mmseg - INFO - Saving checkpoint at 400 iterations
2023-06-11 12:37:27,878 - mmseg - INFO - Iter [400/1000]	lr: 9.592e-06, eta: 0:02:57, time: 0.281, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2763, decode.acc_seg: 74.8496, loss: 0.2763
2023-06-11 12:37:28,969 - mmseg - INFO - Iter [400/1000]	lr: 9.592e-06, eta: 0:02:59, time: 0.302, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2763, decode.acc_seg: 74.8496, loss: 0.2763
2023-06-11 12:37:42,964 - mmseg - INFO - Iter [450/1000]	lr: 9.896e-06, eta: 0:02:43, time: 0.280, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2652, decode.acc_seg: 72.4891, loss: 0.2652
2023-06-11 12:37:42,964 - mmseg - INFO - Iter [450/1000]	lr: 9.896e-06, eta: 0:02:43, time: 0.302, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2652, decode.acc_seg: 72.4891, loss: 0.2652
[                                                  ] 0/10, elapsed: 0s, ETA:INFO: Loading faiss with AVX2 support.
INFO: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
INFO: Loading faiss.
INFO: Successfully loaded faiss.
INFO: Loading faiss with AVX2 support.
INFO: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
INFO: Loading faiss.
INFO: Successfully loaded faiss.
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 10/10, 2.4 task/s, elapsed: 4s, ETA:     0s2023-06-11 12:38:01,950 - mmseg - INFO - Iter(val) [500]	


2023-06-11 12:38:02,066 - mmseg - INFO - per class results:
2023-06-11 12:38:02,066 - mmseg - INFO - 
+------------+-------+-------+
| Class      | IoU   | Acc   |
+------------+-------+-------+
| foreground | 53.26 | 71.19 |
| background | 85.34 | 91.53 |
+------------+-------+-------+
2023-06-11 12:38:02,066 - mmseg - INFO - Summary:
2023-06-11 12:38:02,067 - mmseg - INFO - 
+--------+------+-------+-------+
| Scope  | mIoU | mAcc  | aAcc  |
+--------+------+-------+-------+
| global | 69.3 | 81.36 | 87.44 |
+--------+------+-------+-------+
2023-06-11 12:38:02,067 - mmseg - INFO - Iter(val) [500]	mIoU: 0.6930, mAcc: 0.8136, aAcc: 0.8744
2023-06-11 12:38:16,334 - mmseg - INFO - Iter [550/1000]	lr: 9.904e-06, eta: 0:02:05, time: 0.384, data_time: 0.100, memory: 4067, decode.loss_seg: 0.2621, decode.acc_seg: 75.0601, loss: 0.2621
2023-06-11 12:38:16,334 - mmseg - INFO - Iter [550/1000]	lr: 9.904e-06, eta: 0:02:05, time: 0.384, data_time: 0.102, memory: 4067, decode.loss_seg: 0.2621, decode.acc_seg: 75.0601, loss: 0.2621
2023-06-11 12:38:30,716 - mmseg - INFO - Saving checkpoint at 600 iterations
2023-06-11 12:38:30,720 - mmseg - INFO - Saving checkpoint at 600 iterations
2023-06-11 12:38:30,721 - mmseg - INFO - Iter [600/1000]	lr: 9.608e-06, eta: 0:01:51, time: 0.288, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2489, decode.acc_seg: 73.1485, loss: 0.2489
2023-06-11 12:38:31,800 - mmseg - INFO - Iter [600/1000]	lr: 9.608e-06, eta: 0:01:52, time: 0.309, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2489, decode.acc_seg: 73.1485, loss: 0.2489
2023-06-11 12:38:46,376 - mmseg - INFO - Iter [650/1000]	lr: 9.112e-06, eta: 0:01:38, time: 0.291, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2523, decode.acc_seg: 74.6665, loss: 0.2523
2023-06-11 12:38:46,376 - mmseg - INFO - Iter [650/1000]	lr: 9.112e-06, eta: 0:01:38, time: 0.313, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2523, decode.acc_seg: 74.6665, loss: 0.2523
2023-06-11 12:39:01,168 - mmseg - INFO - Iter [700/1000]	lr: 8.416e-06, eta: 0:01:24, time: 0.296, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2514, decode.acc_seg: 75.2203, loss: 0.2514
2023-06-11 12:39:01,171 - mmseg - INFO - Iter [700/1000]	lr: 8.416e-06, eta: 0:01:24, time: 0.296, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2514, decode.acc_seg: 75.2203, loss: 0.2514
2023-06-11 12:39:16,173 - mmseg - INFO - Iter [750/1000]	lr: 7.520e-06, eta: 0:01:10, time: 0.300, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2548, decode.acc_seg: 76.1941, loss: 0.2548
2023-06-11 12:39:16,173 - mmseg - INFO - Iter [750/1000]	lr: 7.520e-06, eta: 0:01:10, time: 0.300, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2548, decode.acc_seg: 76.1941, loss: 0.2548
2023-06-11 12:39:31,288 - mmseg - INFO - Saving checkpoint at 800 iterations
2023-06-11 12:39:31,289 - mmseg - INFO - Saving checkpoint at 800 iterations
2023-06-11 12:39:31,290 - mmseg - INFO - Iter [800/1000]	lr: 6.424e-06, eta: 0:00:56, time: 0.302, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2445, decode.acc_seg: 74.3814, loss: 0.2445
2023-06-11 12:39:32,398 - mmseg - INFO - Iter [800/1000]	lr: 6.424e-06, eta: 0:00:57, time: 0.324, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2445, decode.acc_seg: 74.3814, loss: 0.2445
2023-06-11 12:39:47,608 - mmseg - INFO - Iter [850/1000]	lr: 5.128e-06, eta: 0:00:43, time: 0.304, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2538, decode.acc_seg: 78.8389, loss: 0.2538
2023-06-11 12:39:47,608 - mmseg - INFO - Iter [850/1000]	lr: 5.128e-06, eta: 0:00:43, time: 0.326, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2538, decode.acc_seg: 78.8389, loss: 0.2538
2023-06-11 12:40:03,097 - mmseg - INFO - Iter [900/1000]	lr: 3.632e-06, eta: 0:00:28, time: 0.310, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2402, decode.acc_seg: 75.8235, loss: 0.2402
2023-06-11 12:40:03,101 - mmseg - INFO - Iter [900/1000]	lr: 3.632e-06, eta: 0:00:28, time: 0.310, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2402, decode.acc_seg: 75.8235, loss: 0.2402
2023-06-11 12:40:18,953 - mmseg - INFO - Iter [950/1000]	lr: 1.936e-06, eta: 0:00:14, time: 0.317, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2370, decode.acc_seg: 75.0212, loss: 0.2370
2023-06-11 12:40:18,956 - mmseg - INFO - Iter [950/1000]	lr: 1.936e-06, eta: 0:00:14, time: 0.317, data_time: 0.004, memory: 4067, decode.loss_seg: 0.2370, decode.acc_seg: 75.0212, loss: 0.2370
2023-06-11 12:40:34,453 - mmseg - INFO - Saving checkpoint at 1000 iterations
2023-06-11 12:40:34,458 - mmseg - INFO - Saving checkpoint at 1000 iterations
[                                                  ] 0/10, elapsed: 0s, ETA:INFO: Loading faiss with AVX2 support.
INFO: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
INFO: Loading faiss.
INFO: Successfully loaded faiss.
INFO: Loading faiss with AVX2 support.
INFO: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
INFO: Loading faiss.
INFO: Successfully loaded faiss.
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 10/10, 2.4 task/s, elapsed: 4s, ETA:     0s2023-06-11 12:40:40,319 - mmseg - INFO - Iter(val) [1000]	


2023-06-11 12:40:40,416 - mmseg - INFO - per class results:
2023-06-11 12:40:40,416 - mmseg - INFO - 
+------------+-------+-------+
| Class      | IoU   | Acc   |
+------------+-------+-------+
| foreground | 56.36 | 70.93 |
| background | 87.12 | 93.49 |
+------------+-------+-------+
2023-06-11 12:40:40,416 - mmseg - INFO - Summary:
2023-06-11 12:40:40,416 - mmseg - INFO - 
+--------+-------+-------+-------+
| Scope  | mIoU  | mAcc  | aAcc  |
+--------+-------+-------+-------+
| global | 71.74 | 82.21 | 88.95 |
+--------+-------+-------+-------+
2023-06-11 12:40:40,416 - mmseg - INFO - Iter(val) [1000]	mIoU: 0.7174, mAcc: 0.8221, aAcc: 0.8895
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: PASS
2023-06-11 21:40:44,596 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Morganh · June 12, 2023, 9:47am

Thanks for the info. We will fix this issue in next release.

Morganh · August 8, 2023, 3:29am

Available in TAO 5.0 release.

Topic		Replies	Views
Error when evaluate PointPillar network TAO Toolkit	6	914	June 4, 2023
Multigpu training raises error TAO Toolkit	9	1280	November 15, 2022
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck TAO Toolkit	80	3068	October 11, 2023
Error when training with multiple GPUs in TAO TAO Toolkit	17	2200	May 4, 2023
More than 1 GPU not working using Tao Train TAO Toolkit	47	5416	April 9, 2023
Error during multi-GPU training of classification_tf1: cma_ep.c process_vm_readv Operation not permitted TAO Toolkit	30	2395	June 1, 2023
Tao multiple - GPUs TAO Toolkit	6	961	February 8, 2022
TAO API - Detectnet_v2 - Multi GPU Stuck TAO Toolkit	57	2515	August 29, 2023
Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14] TAO Toolkit	51	1989	July 26, 2022
Unable to use multiple GPUs to train grounding dino TAO Toolkit cuda , tao	13	144	January 19, 2026

Train Pointpillar with Multi-GPU

Related topics