Classification_pyt error

wesen.khoo · September 12, 2024, 3:04am

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
classification_pyt
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)

train:
  exp_config:
    manual_seed: 49
  train_config:
    runner:
      max_epochs: 40
    checkpoint_config:
      interval: 1
    logging:
      interval: 500
    validate: True
    evaluation:
      interval: 1
    custom_hooks:
      - type: "EMAHook"
        momentum: 4e-5
        priority: "ABOVE_NORMAL"
dataset:
  data:
    samples_per_gpu: 8
    train:
      data_prefix: /data/cats_dogs_dataset/training_set/training_set/
      pipeline: # Augmentations alone
        - type: RandomResizedCrop
          size: 224
        - type: RandomFlip
          flip_prob: 0.5
          direction: "horizontal"
      classes: /data/cats_dogs_dataset/classes.txt
    val:
      data_prefix: /data/cats_dogs_dataset/val_set/val_set
      classes: /data/cats_dogs_dataset/classes.txt
    test:
      data_prefix: /data/cats_dogs_dataset/val_set/val_set
      classes: /data/cats_dogs_dataset/classes.txt
model:
  backbone:
    type: "fan_tiny_8_p4_hybrid"
    custom_args:
      drop_path: 0.1
  head:
    type: "FANLinearClsHead"
    custom_args:
      head_init_scale: 1
    num_classes: 2
    loss:
      type: "CrossEntropyLoss"
      loss_weight: 1.0
      use_soft: False

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

env: EPOCHS=5
Train Classification Model
2024-09-12 11:01:56,227 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-09-12 11:01:56,351 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-09-12 11:01:56,494 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-09-12 11:01:56,494 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-09-12 03:02:02,955 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
[overrides ...]train.py: error: unrecognized arguments: -g 1th}]]ydra,all}]
E0912 03:02:15.628000 139649349068608 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 2) local_rank: 0 (pid: 541) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:  time      : 2024-09-12_03:02:15
  host      : a70d5fe5d884
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 541)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-09-12 03:02:15,868 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-09-12 03:02:15,868 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-09-12 03:02:15,868 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'classification_pyt', 'gpu': ['NVIDIA-RTX-A6000'], 'success': False, 'time_lapsed': 11} to https://api.tao.ngc.nvidia.com.
[2024-09-12 03:02:17,422 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-09-12 03:02:17,423 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-09-12 03:02:17,423 - TAO Toolkit - root - WARNING] Execution status: FAIL
2024-09-12 11:02:18,346 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

This is the error I am facing.

Morganh · September 12, 2024, 3:27am

Can you share the command line? Seems that there is unrecognized argument.
You can take a look at notebook tao_tutorials/notebooks/tao_launcher_starter_kit/dino/dino.ipynb at main · NVIDIA/tao_tutorials · GitHub as reference.

wesen.khoo · September 12, 2024, 3:29am

%env EPOCHS = 5

print("Train Classification Model")
!tao model classification_pyt train \
                  -e /workspace/tao-experiments/specs/train_cats_dogs.yaml \
                  -r $RESULTS_DIR/classification_experiment \
                  -g $NUM_GPUS \
                  train.train_config.runner.max_epochs=$EPOCHS

Morganh · September 12, 2024, 3:33am

It is not expected. Please refer to https://docs.nvidia.com/tao/tao-toolkit/text/cv_finetuning/pytorch/image_classification_pyt.html#training-the-model or notebook to set train.num_gpus.

wesen.khoo · September 12, 2024, 4:17am

!tao model classification_pyt train \
                  -e /workspace/tao-experiments/specs/train_cats_dogs.yaml \
                  train.num_gpus=1 \
                  results_dir=/workspace/tao-experiments/result/classification_experiment \
                  train.train_config.runner.max_epochs=5

This is my latest command line, it has the same error

wesen.khoo · September 12, 2024, 6:41am

Hi, I dont think this issue has fixed

Morganh · September 12, 2024, 7:17am

May I know which TAO docker you are using?
Please refer to the spec file tao_tutorials/notebooks/tao_launcher_starter_kit/classification_pyt/specs/train_cats_dogs.yaml at main · NVIDIA/tao_tutorials · GitHub.

Please share the latest error log. Thanks.

wesen.khoo · September 12, 2024, 7:56am

Error executing job with overrides: ['results_dir=/workspace/tao-experiments/result/classification_experiment', 'train.train_config.runner.max_epochs=5', 'train.gpu_ids=[0]', 'train.num_gpus=1']Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 88, in main
    run_experiment(cfg)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 74, in run_experiment
    runner.train()
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 1728, in train
    self._train_loop = self.build_train_loop(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 1527, in build_train_loop
    loop = EpochBasedTrainLoop(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 44, in __init__
    super().__init__(runner, dataloader)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/base_loop.py", line 26, in __init__
    self.dataloader = runner.build_dataloader(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 1370, in build_dataloader
    dataset = DATASETS.build(dataset_cfg)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/mmpretrain/datasets/imagenet.py", line 122, in __init__
    super().__init__(
  File "/usr/local/lib/python3.10/dist-packages/mmpretrain/datasets/custom.py", line 207, in __init__
    super().__init__(
  File "/usr/local/lib/python3.10/dist-packages/mmpretrain/datasets/base_dataset.py", line 97, in __init__
    transforms.append(TRANSFORMS.build(transform))
  File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
TypeError: RandomResizedCrop.__init__() got an unexpected keyword argument 'size'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
E0912 06:38:01.120000 140293964285760 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 1) local_rank: 0 (pid: 541) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:  time      : 2024-09-12_06:38:01
  host      : 71eb16616be0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 541)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-09-12 06:38:01,350 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-09-12 06:38:01,350 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-09-12 06:38:01,351 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'classification_pyt', 'gpu': ['NVIDIA-RTX-A6000'], 'success': False, 'time_lapsed': 16} to https://api.tao.ngc.nvidia.com.
[2024-09-12 06:38:02,933 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-09-12 06:38:02,934 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-09-12 06:38:02,935 - TAO Toolkit - root - WARNING] Execution status: FAIL
2024-09-12 14:38:03,972 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
To resume from a checkpoint, use the below command. Update the epoch number accordingly

The latest error log shows the same error and it has another error on top. Furthermore I think the version of docker was shown below

Morganh · September 12, 2024, 8:01am

You are using 5.5 docker.
Your spec file set unexpected “size”.
Please follow the doc or notebook tao_tutorials/notebooks/tao_launcher_starter_kit/classification_pyt/specs/train_cats_dogs.yaml at main · NVIDIA/tao_tutorials · GitHub.
Should be “scale” instead of “size”.

wesen.khoo · September 12, 2024, 9:00am

Great News I could train with the default dataset, however I am trying to train with my own dataset and I follow the file structure with the default dataset.

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertion

I am facing this error while I tried to train the command line is the same.

wesen.khoo · September 12, 2024, 9:35am

train:
  exp_config:
    manual_seed: 49
  train_config:
    runner:
      max_epochs: 40
    checkpoint_config:
      interval: 1
    logging:
      interval: 500
    validate: True
    evaluation:
      interval: 1
    custom_hooks:
      - type: "EMAHook"
        momentum: 4e-5
        priority: "ABOVE_NORMAL"
dataset:
  data:
    samples_per_gpu: 8
    train:
      data_prefix: /workspace/tao-experiments/data/car_color/training_set
      pipeline: # Augmentations alone
        - type: RandomResizedCrop
          scale: 224
        - type: RandomFlip
          prob: 0.5
          direction: "horizontal"
      classes:  /workspace/tao-experiments/data/car_color/classes.txt
    val:
      data_prefix:  /workspace/tao-experiments/data/car_color/val_set
      classes:  /workspace/tao-experiments/data/car_color/classes.txt
    # test:
    #   data_prefix:  /workspace/tao-experiments/data/car_color/val_set/val_set
    #   classes:  /workspace/tao-experiments/data/car_color/classes.txt
model:
  backbone:
    type: "fan_tiny_8_p4_hybrid"
    custom_args:
      drop_path: 0.1
  head:
    type: "FANLinearClsHead"
    custom_args:
      head_init_scale: 1
    num_classes: 2
    loss:
      type: "CrossEntropyLoss"
      loss_weight: 1.0
      use_soft: False

I did not modify a lot in my yaml file

wesen.khoo · September 12, 2024, 10:08am

env: EPOCHS=5
Train Classification Model
2024-09-12 18:07:38,258 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-09-12 18:07:38,391 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-09-12 18:07:38,535 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-09-12 18:07:38,535 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-09-12 10:07:45,141 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
Train results will be saved at: /workspace/tao-experiments/result/car_color/train
09/12 10:07:56 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]    CUDA available: True
    MUSA available: False
    numpy_random_seed: 49
    GPU 0: NVIDIA RTX A6000
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.4, V12.4.131
    GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
    PyTorch: 2.3.0a0+6ddf5cf85e.nv24.04
    PyTorch compiling details: PyTorch built with:
  - GCC 11.2
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash N/A)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.4
  - NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_72,code=sm_72;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_87,code=sm_87;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_90,code=compute_90
  - CuDNN 90.1
  - Magma 2.6.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/c++, CXX_FLAGS=-fno-gnu-unique -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

    TorchVision: 0.18.0a0
    OpenCV: 4.7.0
    MMEngine: 0.10.4

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 49
    deterministic: False
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 1
------------------------------------------------------------

09/12 10:07:56 - mmengine - INFO - Config:
auto_scale_lr = dict(base_batch_size=1024)
custom_hooks = [
    dict(momentum=4e-05, priority='ABOVE_NORMAL', type='EMAHook'),
]
data_preprocessor = dict(
    mean=[
        123.675,
        116.28,
        103.53,
    ],
    num_classes=2,
    std=[
        58.395,
        57.12,
        57.375,
    ],
    to_rgb=True)
dataset_type = 'ImageNet'
default_hooks = dict(
    checkpoint=dict(interval=1, type='CheckpointHook'),
    logger=dict(interval=500, type='TaoTextLoggerHook'),
    param_scheduler=dict(type='ParamSchedulerHook'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    timer=dict(type='IterTimerHook'),
    visualization=dict(enable=False, type='VisualizationHook'))
default_scope = 'mmpretrain'
env_cfg = dict(
    cudnn_benchmark=False,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
find_unused_parameters = False
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
model = dict(
    backbone=dict(
        drop_path=0.1,
        freeze=False,
        init_cfg=None,
        pretrained=None,
        type='fan_tiny_8_p4_hybrid'),
    head=dict(
        binary=False,
        head_init_scale=1,
        in_channels=192,
        loss=dict(loss_weight=1.0, type='CrossEntropyLoss', use_soft=False),
        num_classes=2,
        type='TAOLinearClsHead'),
    neck=None,
    train_cfg=dict(augments=None),
    type='ImageClassifier')
optim_wrapper = dict(
    optimizer=dict(lr=0.001, type='AdamW', weight_decay=0.05),
    paramwise_cfg=None)
param_scheduler = [
    dict(type='CosineAnnealingLR'),
]
randomness = dict(deterministic=False, seed=49)
resume = False
test_cfg = dict()
test_dataloader = dict(
    batch_size=8,
    collate_fn=dict(type='default_collate'),
    dataset=dict(
        ann_file=None,
        classes='/workspace/tao-experiments/data/car_color/classes.txt',
        data_prefix='/workspace/tao-experiments/data/car_color/val_set/val_set',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(scale=224, type='Resize'),
            dict(crop_size=224, type='CenterCrop'),
            dict(type='PackInputs'),
        ],
        type='ImageNet'),
    num_workers=2,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
test_evaluator = dict(topk=(1, ), type='Accuracy')
train_cfg = dict(by_epoch=True, max_epochs=5, val_interval=1)
train_dataloader = dict(
    batch_size=8,
    collate_fn=dict(type='default_collate'),
    dataset=dict(
        classes='/workspace/tao-experiments/data/car_color/classes.txt',
        data_prefix=
        '/workspace/tao-experiments/data/car_color/training_set/training_set/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(scale=224, type='RandomResizedCrop'),
            dict(direction='horizontal', prob=0.5, type='RandomFlip'),
            dict(type='PackInputs'),
        ],
        type='ImageNet'),
    num_workers=2,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
val_cfg = dict()
val_dataloader = dict(
    batch_size=8,
    collate_fn=dict(type='default_collate'),
    dataset=dict(
        ann_file=None,
        classes='/workspace/tao-experiments/data/car_color/classes.txt',
        data_prefix='/workspace/tao-experiments/data/car_color/val_set/val_set',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(scale=224, type='Resize'),
            dict(crop_size=224, type='CenterCrop'),
            dict(type='PackInputs'),
        ],
        type='ImageNet'),
    num_workers=2,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
val_evaluator = dict(topk=(1, ), type='Accuracy')
vis_backends = [
    dict(type='LocalVisBackend'),
]
visualizer = dict(
    type='UniversalVisualizer', vis_backends=[
        dict(type='LocalVisBackend'),
    ])
work_dir = '/workspace/tao-experiments/result/car_color/train'

09/12 10:07:56 - mmengine - INFO - Because batch augmentations are enabled, the data preprocessor automatically enables the `to_onehot` option to generate one-hot format labels.
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
09/12 10:07:57 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) RuntimeInfoHook                    
(ABOVE_NORMAL) EMAHook                            
(BELOW_NORMAL) TaoTextLoggerHook                  
 -------------------- 
after_load_checkpoint:
(ABOVE_NORMAL) EMAHook                            
 -------------------- 
before_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(ABOVE_NORMAL) EMAHook                            
(NORMAL      ) IterTimerHook                      
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(NORMAL      ) DistSamplerSeedHook                
 -------------------- 
before_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(ABOVE_NORMAL) EMAHook                            
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) TaoTextLoggerHook                  
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) IterTimerHook                      
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_val:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
before_val_epoch:
(ABOVE_NORMAL) EMAHook                            
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_val_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_val_iter:
(NORMAL      ) IterTimerHook                      
(NORMAL      ) VisualizationHook                  
(BELOW_NORMAL) TaoTextLoggerHook                  
 -------------------- 
after_val_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(ABOVE_NORMAL) EMAHook                            
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) TaoTextLoggerHook                  
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_val:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
before_save_checkpoint:
(ABOVE_NORMAL) EMAHook                            
 -------------------- 
after_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_test:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
before_test_epoch:
(ABOVE_NORMAL) EMAHook                            
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_test_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_test_iter:
(NORMAL      ) IterTimerHook                      
(NORMAL      ) VisualizationHook                  
(BELOW_NORMAL) TaoTextLoggerHook                  
 -------------------- 
after_test_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(ABOVE_NORMAL) EMAHook                            
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) TaoTextLoggerHook                  
 -------------------- 
after_test:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
after_run:
(BELOW_NORMAL) TaoTextLoggerHook                  
 -------------------- 
09/12 10:07:58 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
09/12 10:07:58 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
09/12 10:07:58 - mmengine - INFO - Checkpoints will be saved to /workspace/tao-experiments/result/car_color/train.
Error executing job with overrides: ['results_dir=/workspace/tao-experiments/result/car_color', 'train.train_config.runner.max_epochs=5', 'train.gpu_ids=[0]', 'train.num_gpus=1']Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 88, in main
    run_experiment(cfg)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 74, in run_experiment
    runner.train()
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 1777, in train
    model = self.train_loop.run()  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 96, in run
    self.run_epoch()
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 113, in run_epoch
    self.run_iter(idx, data_batch)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 129, in run_iter
    outputs = self.runner.model.train_step(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/model/wrappers/distributed.py", line 120, in train_step
    data = self.module.data_preprocessor(data, training=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/mmpretrain/models/utils/data_preprocessor.py", line 177, in forward
    batch_score = batch_label_to_onehot(
  File "/usr/local/lib/python3.10/dist-packages/mmpretrain/structures/utils.py", line 124, in batch_label_to_onehot
    onehot_list = [
  File "/usr/local/lib/python3.10/dist-packages/mmpretrain/structures/utils.py", line 125, in <listcomp>
    sparse_onehot.sum(0)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [0,0,0], thread: [7,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.E0912 10:08:02.749000 139973185251136 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 1) local_rank: 0 (pid: 541) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:  time      : 2024-09-12_10:08:02
  host      : 6c639e40d573
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 541)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-09-12 10:08:02,989 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-09-12 10:08:02,989 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-09-12 10:08:02,994 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'classification_pyt', 'gpu': ['NVIDIA-RTX-A6000'], 'success': False, 'time_lapsed': 16} to https://api.tao.ngc.nvidia.com.
[2024-09-12 10:08:04,577 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-09-12 10:08:04,578 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-09-12 10:08:04,578 - TAO Toolkit - root - WARNING] Execution status: FAIL
2024-09-12 18:08:05,614 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
To resume from a checkpoint, use the below command. Update the epoch number accordingly

Full log here

Morganh · September 12, 2024, 10:34am

The num_classes is not correct.

wesen.khoo · September 13, 2024, 8:55am

After I trained the pth model I have applied to deepstream? I have no idea which color-mode should I use RGB or BGR, because the result in BGR has a better result compare to RGB color-mode.

Morganh · September 13, 2024, 4:00pm

Besides deepstream, you can run tao model classification_pyt evaluate or `tao deploy classification_pyt evaluate. Refer to Image Classification PyT - NVIDIA Docs.

The preprocessing is using RGB. You can refer to the setting in the bottom of
Deploying to DeepStream for Classification TF1/TF2/PyTorch - NVIDIA Docs. The default mode is ‘torch’ according to
tao_deploy/nvidia_tao_deploy/cv/classification_pyt/scripts/inference.py at 31c7e0ed3fe48942c254b3b85517e7418eea17b3 · NVIDIA/tao_deploy · GitHub.

wesen.khoo · September 18, 2024, 3:13am

Thank you, this solve the problem!!

system · October 2, 2024, 3:13am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Image Classification Pytorch Training Error TAO Toolkit cudnn	10	196	September 23, 2024
Probelm as running visual_changenet_classification on TAO launcher TAO Toolkit	41	1026	November 21, 2023
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	1708	August 22, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1505	July 6, 2022
Tao Text Classification Evaluate failing TAO Toolkit	18	1360	October 12, 2021
Tao Text Classification Evaluate failing TAO Toolkit tao	5	1351	October 12, 2021
Tao toolkit detectnet training kitty format error TAO Toolkit	10	414	December 8, 2023
TAO 4.0 AutoML Detectnet_V2 KeyError on training step TAO Toolkit	19	673	July 15, 2023
Error when evaluate PointPillar network TAO Toolkit	6	739	June 4, 2023
Detectnet2 TAO Toolkit model training fail on formating dataset on kitti format TAO Toolkit	69	958	January 22, 2024

Classification_pyt error

Related topics