Classification_pyt error

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
classification_pyt
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)

train:
  exp_config:
    manual_seed: 49
  train_config:
    runner:
      max_epochs: 40
    checkpoint_config:
      interval: 1
    logging:
      interval: 500
    validate: True
    evaluation:
      interval: 1
    custom_hooks:
      - type: "EMAHook"
        momentum: 4e-5
        priority: "ABOVE_NORMAL"
dataset:
  data:
    samples_per_gpu: 8
    train:
      data_prefix: /data/cats_dogs_dataset/training_set/training_set/
      pipeline: # Augmentations alone
        - type: RandomResizedCrop
          size: 224
        - type: RandomFlip
          flip_prob: 0.5
          direction: "horizontal"
      classes: /data/cats_dogs_dataset/classes.txt
    val:
      data_prefix: /data/cats_dogs_dataset/val_set/val_set
      classes: /data/cats_dogs_dataset/classes.txt
    test:
      data_prefix: /data/cats_dogs_dataset/val_set/val_set
      classes: /data/cats_dogs_dataset/classes.txt
model:
  backbone:
    type: "fan_tiny_8_p4_hybrid"
    custom_args:
      drop_path: 0.1
  head:
    type: "FANLinearClsHead"
    custom_args:
      head_init_scale: 1
    num_classes: 2
    loss:
      type: "CrossEntropyLoss"
      loss_weight: 1.0
      use_soft: False

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

env: EPOCHS=5
Train Classification Model
2024-09-12 11:01:56,227 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-09-12 11:01:56,351 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-09-12 11:01:56,494 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-09-12 11:01:56,494 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-09-12 03:02:02,955 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
[overrides ...]train.py: error: unrecognized arguments: -g 1th}]]ydra,all}]
E0912 03:02:15.628000 139649349068608 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 2) local_rank: 0 (pid: 541) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:  time      : 2024-09-12_03:02:15
  host      : a70d5fe5d884
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 541)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-09-12 03:02:15,868 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-09-12 03:02:15,868 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-09-12 03:02:15,868 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'classification_pyt', 'gpu': ['NVIDIA-RTX-A6000'], 'success': False, 'time_lapsed': 11} to https://api.tao.ngc.nvidia.com.
[2024-09-12 03:02:17,422 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-09-12 03:02:17,423 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-09-12 03:02:17,423 - TAO Toolkit - root - WARNING] Execution status: FAIL
2024-09-12 11:02:18,346 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

This is the error I am facing.

Can you share the command line? Seems that there is unrecognized argument.
You can take a look at notebook tao_tutorials/notebooks/tao_launcher_starter_kit/dino/dino.ipynb at main · NVIDIA/tao_tutorials · GitHub as reference.

%env EPOCHS = 5

print("Train Classification Model")
!tao model classification_pyt train \
                  -e /workspace/tao-experiments/specs/train_cats_dogs.yaml \
                  -r $RESULTS_DIR/classification_experiment \
                  -g $NUM_GPUS \
                  train.train_config.runner.max_epochs=$EPOCHS

It is not expected. Please refer to https://docs.nvidia.com/tao/tao-toolkit/text/cv_finetuning/pytorch/image_classification_pyt.html#training-the-model or notebook to set train.num_gpus.

!tao model classification_pyt train \
                  -e /workspace/tao-experiments/specs/train_cats_dogs.yaml \
                  train.num_gpus=1 \
                  results_dir=/workspace/tao-experiments/result/classification_experiment \
                  train.train_config.runner.max_epochs=5

This is my latest command line, it has the same error

Hi, I dont think this issue has fixed

May I know which TAO docker you are using?
Please refer to the spec file tao_tutorials/notebooks/tao_launcher_starter_kit/classification_pyt/specs/train_cats_dogs.yaml at main · NVIDIA/tao_tutorials · GitHub.

Please share the latest error log. Thanks.

Error executing job with overrides: ['results_dir=/workspace/tao-experiments/result/classification_experiment', 'train.train_config.runner.max_epochs=5', 'train.gpu_ids=[0]', 'train.num_gpus=1']Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 88, in main
    run_experiment(cfg)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 74, in run_experiment
    runner.train()
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 1728, in train
    self._train_loop = self.build_train_loop(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 1527, in build_train_loop
    loop = EpochBasedTrainLoop(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 44, in __init__
    super().__init__(runner, dataloader)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/base_loop.py", line 26, in __init__
    self.dataloader = runner.build_dataloader(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 1370, in build_dataloader
    dataset = DATASETS.build(dataset_cfg)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/mmpretrain/datasets/imagenet.py", line 122, in __init__
    super().__init__(
  File "/usr/local/lib/python3.10/dist-packages/mmpretrain/datasets/custom.py", line 207, in __init__
    super().__init__(
  File "/usr/local/lib/python3.10/dist-packages/mmpretrain/datasets/base_dataset.py", line 97, in __init__
    transforms.append(TRANSFORMS.build(transform))
  File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
TypeError: RandomResizedCrop.__init__() got an unexpected keyword argument 'size'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
E0912 06:38:01.120000 140293964285760 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 1) local_rank: 0 (pid: 541) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:  time      : 2024-09-12_06:38:01
  host      : 71eb16616be0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 541)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-09-12 06:38:01,350 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-09-12 06:38:01,350 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-09-12 06:38:01,351 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'classification_pyt', 'gpu': ['NVIDIA-RTX-A6000'], 'success': False, 'time_lapsed': 16} to https://api.tao.ngc.nvidia.com.
[2024-09-12 06:38:02,933 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-09-12 06:38:02,934 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-09-12 06:38:02,935 - TAO Toolkit - root - WARNING] Execution status: FAIL
2024-09-12 14:38:03,972 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
To resume from a checkpoint, use the below command. Update the epoch number accordingly

The latest error log shows the same error and it has another error on top. Furthermore I think the version of docker was shown below

You are using 5.5 docker.
Your spec file set unexpected “size”.
Please follow the doc or notebook tao_tutorials/notebooks/tao_launcher_starter_kit/classification_pyt/specs/train_cats_dogs.yaml at main · NVIDIA/tao_tutorials · GitHub.
Should be “scale” instead of “size”.

Great News I could train with the default dataset, however I am trying to train with my own dataset and I follow the file structure with the default dataset.

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertion

I am facing this error while I tried to train the command line is the same.

train:
  exp_config:
    manual_seed: 49
  train_config:
    runner:
      max_epochs: 40
    checkpoint_config:
      interval: 1
    logging:
      interval: 500
    validate: True
    evaluation:
      interval: 1
    custom_hooks:
      - type: "EMAHook"
        momentum: 4e-5
        priority: "ABOVE_NORMAL"
dataset:
  data:
    samples_per_gpu: 8
    train:
      data_prefix: /workspace/tao-experiments/data/car_color/training_set
      pipeline: # Augmentations alone
        - type: RandomResizedCrop
          scale: 224
        - type: RandomFlip
          prob: 0.5
          direction: "horizontal"
      classes:  /workspace/tao-experiments/data/car_color/classes.txt
    val:
      data_prefix:  /workspace/tao-experiments/data/car_color/val_set
      classes:  /workspace/tao-experiments/data/car_color/classes.txt
    # test:
    #   data_prefix:  /workspace/tao-experiments/data/car_color/val_set/val_set
    #   classes:  /workspace/tao-experiments/data/car_color/classes.txt
model:
  backbone:
    type: "fan_tiny_8_p4_hybrid"
    custom_args:
      drop_path: 0.1
  head:
    type: "FANLinearClsHead"
    custom_args:
      head_init_scale: 1
    num_classes: 2
    loss:
      type: "CrossEntropyLoss"
      loss_weight: 1.0
      use_soft: False

I did not modify a lot in my yaml file

env: EPOCHS=5
Train Classification Model
2024-09-12 18:07:38,258 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-09-12 18:07:38,391 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-09-12 18:07:38,535 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-09-12 18:07:38,535 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-09-12 10:07:45,141 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
Train results will be saved at: /workspace/tao-experiments/result/car_color/train
09/12 10:07:56 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]    CUDA available: True
    MUSA available: False
    numpy_random_seed: 49
    GPU 0: NVIDIA RTX A6000
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.4, V12.4.131
    GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
    PyTorch: 2.3.0a0+6ddf5cf85e.nv24.04
    PyTorch compiling details: PyTorch built with:
  - GCC 11.2
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash N/A)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.4
  - NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_72,code=sm_72;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_87,code=sm_87;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_90,code=compute_90
  - CuDNN 90.1
  - Magma 2.6.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/c++, CXX_FLAGS=-fno-gnu-unique -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

    TorchVision: 0.18.0a0
    OpenCV: 4.7.0
    MMEngine: 0.10.4

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 49
    deterministic: False
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 1
------------------------------------------------------------

09/12 10:07:56 - mmengine - INFO - Config:
auto_scale_lr = dict(base_batch_size=1024)
custom_hooks = [
    dict(momentum=4e-05, priority='ABOVE_NORMAL', type='EMAHook'),
]
data_preprocessor = dict(
    mean=[
        123.675,
        116.28,
        103.53,
    ],
    num_classes=2,
    std=[
        58.395,
        57.12,
        57.375,
    ],
    to_rgb=True)
dataset_type = 'ImageNet'
default_hooks = dict(
    checkpoint=dict(interval=1, type='CheckpointHook'),
    logger=dict(interval=500, type='TaoTextLoggerHook'),
    param_scheduler=dict(type='ParamSchedulerHook'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    timer=dict(type='IterTimerHook'),
    visualization=dict(enable=False, type='VisualizationHook'))
default_scope = 'mmpretrain'
env_cfg = dict(
    cudnn_benchmark=False,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
find_unused_parameters = False
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
model = dict(
    backbone=dict(
        drop_path=0.1,
        freeze=False,
        init_cfg=None,
        pretrained=None,
        type='fan_tiny_8_p4_hybrid'),
    head=dict(
        binary=False,
        head_init_scale=1,
        in_channels=192,
        loss=dict(loss_weight=1.0, type='CrossEntropyLoss', use_soft=False),
        num_classes=2,
        type='TAOLinearClsHead'),
    neck=None,
    train_cfg=dict(augments=None),
    type='ImageClassifier')
optim_wrapper = dict(
    optimizer=dict(lr=0.001, type='AdamW', weight_decay=0.05),
    paramwise_cfg=None)
param_scheduler = [
    dict(type='CosineAnnealingLR'),
]
randomness = dict(deterministic=False, seed=49)
resume = False
test_cfg = dict()
test_dataloader = dict(
    batch_size=8,
    collate_fn=dict(type='default_collate'),
    dataset=dict(
        ann_file=None,
        classes='/workspace/tao-experiments/data/car_color/classes.txt',
        data_prefix='/workspace/tao-experiments/data/car_color/val_set/val_set',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(scale=224, type='Resize'),
            dict(crop_size=224, type='CenterCrop'),
            dict(type='PackInputs'),
        ],
        type='ImageNet'),
    num_workers=2,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
test_evaluator = dict(topk=(1, ), type='Accuracy')
train_cfg = dict(by_epoch=True, max_epochs=5, val_interval=1)
train_dataloader = dict(
    batch_size=8,
    collate_fn=dict(type='default_collate'),
    dataset=dict(
        classes='/workspace/tao-experiments/data/car_color/classes.txt',
        data_prefix=
        '/workspace/tao-experiments/data/car_color/training_set/training_set/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(scale=224, type='RandomResizedCrop'),
            dict(direction='horizontal', prob=0.5, type='RandomFlip'),
            dict(type='PackInputs'),
        ],
        type='ImageNet'),
    num_workers=2,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
val_cfg = dict()
val_dataloader = dict(
    batch_size=8,
    collate_fn=dict(type='default_collate'),
    dataset=dict(
        ann_file=None,
        classes='/workspace/tao-experiments/data/car_color/classes.txt',
        data_prefix='/workspace/tao-experiments/data/car_color/val_set/val_set',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(scale=224, type='Resize'),
            dict(crop_size=224, type='CenterCrop'),
            dict(type='PackInputs'),
        ],
        type='ImageNet'),
    num_workers=2,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))
val_evaluator = dict(topk=(1, ), type='Accuracy')
vis_backends = [
    dict(type='LocalVisBackend'),
]
visualizer = dict(
    type='UniversalVisualizer', vis_backends=[
        dict(type='LocalVisBackend'),
    ])
work_dir = '/workspace/tao-experiments/result/car_color/train'

09/12 10:07:56 - mmengine - INFO - Because batch augmentations are enabled, the data preprocessor automatically enables the `to_onehot` option to generate one-hot format labels.
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
09/12 10:07:57 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) RuntimeInfoHook                    
(ABOVE_NORMAL) EMAHook                            
(BELOW_NORMAL) TaoTextLoggerHook                  
 -------------------- 
after_load_checkpoint:
(ABOVE_NORMAL) EMAHook                            
 -------------------- 
before_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(ABOVE_NORMAL) EMAHook                            
(NORMAL      ) IterTimerHook                      
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(NORMAL      ) DistSamplerSeedHook                
 -------------------- 
before_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(ABOVE_NORMAL) EMAHook                            
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) TaoTextLoggerHook                  
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) IterTimerHook                      
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_val:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
before_val_epoch:
(ABOVE_NORMAL) EMAHook                            
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_val_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_val_iter:
(NORMAL      ) IterTimerHook                      
(NORMAL      ) VisualizationHook                  
(BELOW_NORMAL) TaoTextLoggerHook                  
 -------------------- 
after_val_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(ABOVE_NORMAL) EMAHook                            
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) TaoTextLoggerHook                  
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_val:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
before_save_checkpoint:
(ABOVE_NORMAL) EMAHook                            
 -------------------- 
after_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_test:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
before_test_epoch:
(ABOVE_NORMAL) EMAHook                            
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_test_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_test_iter:
(NORMAL      ) IterTimerHook                      
(NORMAL      ) VisualizationHook                  
(BELOW_NORMAL) TaoTextLoggerHook                  
 -------------------- 
after_test_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(ABOVE_NORMAL) EMAHook                            
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) TaoTextLoggerHook                  
 -------------------- 
after_test:
(VERY_HIGH   ) RuntimeInfoHook                    
 -------------------- 
after_run:
(BELOW_NORMAL) TaoTextLoggerHook                  
 -------------------- 
09/12 10:07:58 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
09/12 10:07:58 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
09/12 10:07:58 - mmengine - INFO - Checkpoints will be saved to /workspace/tao-experiments/result/car_color/train.
Error executing job with overrides: ['results_dir=/workspace/tao-experiments/result/car_color', 'train.train_config.runner.max_epochs=5', 'train.gpu_ids=[0]', 'train.num_gpus=1']Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 88, in main
    run_experiment(cfg)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 74, in run_experiment
    runner.train()
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 1777, in train
    model = self.train_loop.run()  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 96, in run
    self.run_epoch()
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 113, in run_epoch
    self.run_iter(idx, data_batch)
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 129, in run_iter
    outputs = self.runner.model.train_step(
  File "/usr/local/lib/python3.10/dist-packages/mmengine/model/wrappers/distributed.py", line 120, in train_step
    data = self.module.data_preprocessor(data, training=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/mmpretrain/models/utils/data_preprocessor.py", line 177, in forward
    batch_score = batch_label_to_onehot(
  File "/usr/local/lib/python3.10/dist-packages/mmpretrain/structures/utils.py", line 124, in batch_label_to_onehot
    onehot_list = [
  File "/usr/local/lib/python3.10/dist-packages/mmpretrain/structures/utils.py", line 125, in <listcomp>
    sparse_onehot.sum(0)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [0,0,0], thread: [7,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.E0912 10:08:02.749000 139973185251136 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 1) local_rank: 0 (pid: 541) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:  time      : 2024-09-12_10:08:02
  host      : 6c639e40d573
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 541)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-09-12 10:08:02,989 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-09-12 10:08:02,989 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-09-12 10:08:02,994 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'classification_pyt', 'gpu': ['NVIDIA-RTX-A6000'], 'success': False, 'time_lapsed': 16} to https://api.tao.ngc.nvidia.com.
[2024-09-12 10:08:04,577 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-09-12 10:08:04,578 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-09-12 10:08:04,578 - TAO Toolkit - root - WARNING] Execution status: FAIL
2024-09-12 18:08:05,614 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
To resume from a checkpoint, use the below command. Update the epoch number accordingly

Full log here

The num_classes is not correct.

After I trained the pth model I have applied to deepstream? I have no idea which color-mode should I use RGB or BGR, because the result in BGR has a better result compare to RGB color-mode.

Besides deepstream, you can run tao model classification_pyt evaluate or `tao deploy classification_pyt evaluate. Refer to Image Classification PyT - NVIDIA Docs.

The preprocessing is using RGB. You can refer to the setting in the bottom of
Deploying to DeepStream for Classification TF1/TF2/PyTorch - NVIDIA Docs. The default mode is ‘torch’ according to
tao_deploy/nvidia_tao_deploy/cv/classification_pyt/scripts/inference.py at 31c7e0ed3fe48942c254b3b85517e7418eea17b3 · NVIDIA/tao_deploy · GitHub.

Thank you, this solve the problem!!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.