env: EPOCHS=5
Train Classification Model
2024-09-12 18:07:38,258 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-09-12 18:07:38,391 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-09-12 18:07:38,535 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-09-12 18:07:38,535 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-09-12 10:07:45,141 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
Train results will be saved at: /workspace/tao-experiments/result/car_color/train
09/12 10:07:56 - mmengine - INFO -
------------------------------------------------------------
System environment:
sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] CUDA available: True
MUSA available: False
numpy_random_seed: 49
GPU 0: NVIDIA RTX A6000
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.3.0a0+6ddf5cf85e.nv24.04
PyTorch compiling details: PyTorch built with:
- GCC 11.2
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.3.2 (Git Hash N/A)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 12.4
- NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_72,code=sm_72;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_87,code=sm_87;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_90,code=compute_90
- CuDNN 90.1
- Magma 2.6.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/c++, CXX_FLAGS=-fno-gnu-unique -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.18.0a0
OpenCV: 4.7.0
MMEngine: 0.10.4
Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 49
deterministic: False
Distributed launcher: pytorch
Distributed training: True
GPU number: 1
------------------------------------------------------------
09/12 10:07:56 - mmengine - INFO - Config:
auto_scale_lr = dict(base_batch_size=1024)
custom_hooks = [
dict(momentum=4e-05, priority='ABOVE_NORMAL', type='EMAHook'),
]
data_preprocessor = dict(
mean=[
123.675,
116.28,
103.53,
],
num_classes=2,
std=[
58.395,
57.12,
57.375,
],
to_rgb=True)
dataset_type = 'ImageNet'
default_hooks = dict(
checkpoint=dict(interval=1, type='CheckpointHook'),
logger=dict(interval=500, type='TaoTextLoggerHook'),
param_scheduler=dict(type='ParamSchedulerHook'),
sampler_seed=dict(type='DistSamplerSeedHook'),
timer=dict(type='IterTimerHook'),
visualization=dict(enable=False, type='VisualizationHook'))
default_scope = 'mmpretrain'
env_cfg = dict(
cudnn_benchmark=False,
dist_cfg=dict(backend='nccl'),
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
find_unused_parameters = False
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
model = dict(
backbone=dict(
drop_path=0.1,
freeze=False,
init_cfg=None,
pretrained=None,
type='fan_tiny_8_p4_hybrid'),
head=dict(
binary=False,
head_init_scale=1,
in_channels=192,
loss=dict(loss_weight=1.0, type='CrossEntropyLoss', use_soft=False),
num_classes=2,
type='TAOLinearClsHead'),
neck=None,
train_cfg=dict(augments=None),
type='ImageClassifier')
optim_wrapper = dict(
optimizer=dict(lr=0.001, type='AdamW', weight_decay=0.05),
paramwise_cfg=None)
param_scheduler = [
dict(type='CosineAnnealingLR'),
]
randomness = dict(deterministic=False, seed=49)
resume = False
test_cfg = dict()
test_dataloader = dict(
batch_size=8,
collate_fn=dict(type='default_collate'),
dataset=dict(
ann_file=None,
classes='/workspace/tao-experiments/data/car_color/classes.txt',
data_prefix='/workspace/tao-experiments/data/car_color/val_set/val_set',
pipeline=[
dict(type='LoadImageFromFile'),
dict(scale=224, type='Resize'),
dict(crop_size=224, type='CenterCrop'),
dict(type='PackInputs'),
],
type='ImageNet'),
num_workers=2,
pin_memory=True,
sampler=dict(shuffle=True, type='DefaultSampler'))
test_evaluator = dict(topk=(1, ), type='Accuracy')
train_cfg = dict(by_epoch=True, max_epochs=5, val_interval=1)
train_dataloader = dict(
batch_size=8,
collate_fn=dict(type='default_collate'),
dataset=dict(
classes='/workspace/tao-experiments/data/car_color/classes.txt',
data_prefix=
'/workspace/tao-experiments/data/car_color/training_set/training_set/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(scale=224, type='RandomResizedCrop'),
dict(direction='horizontal', prob=0.5, type='RandomFlip'),
dict(type='PackInputs'),
],
type='ImageNet'),
num_workers=2,
pin_memory=True,
sampler=dict(shuffle=True, type='DefaultSampler'))
val_cfg = dict()
val_dataloader = dict(
batch_size=8,
collate_fn=dict(type='default_collate'),
dataset=dict(
ann_file=None,
classes='/workspace/tao-experiments/data/car_color/classes.txt',
data_prefix='/workspace/tao-experiments/data/car_color/val_set/val_set',
pipeline=[
dict(type='LoadImageFromFile'),
dict(scale=224, type='Resize'),
dict(crop_size=224, type='CenterCrop'),
dict(type='PackInputs'),
],
type='ImageNet'),
num_workers=2,
pin_memory=True,
sampler=dict(shuffle=True, type='DefaultSampler'))
val_evaluator = dict(topk=(1, ), type='Accuracy')
vis_backends = [
dict(type='LocalVisBackend'),
]
visualizer = dict(
type='UniversalVisualizer', vis_backends=[
dict(type='LocalVisBackend'),
])
work_dir = '/workspace/tao-experiments/result/car_color/train'
09/12 10:07:56 - mmengine - INFO - Because batch augmentations are enabled, the data preprocessor automatically enables the `to_onehot` option to generate one-hot format labels.
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
09/12 10:07:57 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) RuntimeInfoHook
(ABOVE_NORMAL) EMAHook
(BELOW_NORMAL) TaoTextLoggerHook
--------------------
after_load_checkpoint:
(ABOVE_NORMAL) EMAHook
--------------------
before_train:
(VERY_HIGH ) RuntimeInfoHook
(ABOVE_NORMAL) EMAHook
(NORMAL ) IterTimerHook
(VERY_LOW ) CheckpointHook
--------------------
before_train_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook
--------------------
before_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
--------------------
after_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(ABOVE_NORMAL) EMAHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) TaoTextLoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
--------------------
after_train_epoch:
(NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
--------------------
before_val:
(VERY_HIGH ) RuntimeInfoHook
--------------------
before_val_epoch:
(ABOVE_NORMAL) EMAHook
(NORMAL ) IterTimerHook
--------------------
before_val_iter:
(NORMAL ) IterTimerHook
--------------------
after_val_iter:
(NORMAL ) IterTimerHook
(NORMAL ) VisualizationHook
(BELOW_NORMAL) TaoTextLoggerHook
--------------------
after_val_epoch:
(VERY_HIGH ) RuntimeInfoHook
(ABOVE_NORMAL) EMAHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) TaoTextLoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
--------------------
after_val:
(VERY_HIGH ) RuntimeInfoHook
--------------------
before_save_checkpoint:
(ABOVE_NORMAL) EMAHook
--------------------
after_train:
(VERY_HIGH ) RuntimeInfoHook
(VERY_LOW ) CheckpointHook
--------------------
before_test:
(VERY_HIGH ) RuntimeInfoHook
--------------------
before_test_epoch:
(ABOVE_NORMAL) EMAHook
(NORMAL ) IterTimerHook
--------------------
before_test_iter:
(NORMAL ) IterTimerHook
--------------------
after_test_iter:
(NORMAL ) IterTimerHook
(NORMAL ) VisualizationHook
(BELOW_NORMAL) TaoTextLoggerHook
--------------------
after_test_epoch:
(VERY_HIGH ) RuntimeInfoHook
(ABOVE_NORMAL) EMAHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) TaoTextLoggerHook
--------------------
after_test:
(VERY_HIGH ) RuntimeInfoHook
--------------------
after_run:
(BELOW_NORMAL) TaoTextLoggerHook
--------------------
09/12 10:07:58 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
09/12 10:07:58 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
09/12 10:07:58 - mmengine - INFO - Checkpoints will be saved to /workspace/tao-experiments/result/car_color/train.
Error executing job with overrides: ['results_dir=/workspace/tao-experiments/result/car_color', 'train.train_config.runner.max_epochs=5', 'train.gpu_ids=[0]', 'train.num_gpus=1']Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
raise e
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
runner(cfg, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 88, in main
run_experiment(cfg)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 74, in run_experiment
runner.train()
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 96, in run
self.run_epoch()
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 113, in run_epoch
self.run_iter(idx, data_batch)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 129, in run_iter
outputs = self.runner.model.train_step(
File "/usr/local/lib/python3.10/dist-packages/mmengine/model/wrappers/distributed.py", line 120, in train_step
data = self.module.data_preprocessor(data, training=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmpretrain/models/utils/data_preprocessor.py", line 177, in forward
batch_score = batch_label_to_onehot(
File "/usr/local/lib/python3.10/dist-packages/mmpretrain/structures/utils.py", line 124, in batch_label_to_onehot
onehot_list = [
File "/usr/local/lib/python3.10/dist-packages/mmpretrain/structures/utils.py", line 125, in <listcomp>
sparse_onehot.sum(0)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [0,0,0], thread: [7,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.E0912 10:08:02.749000 139973185251136 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 1) local_rank: 0 (pid: 541) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]: time : 2024-09-12_10:08:02
host : 6c639e40d573
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 541)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-09-12 10:08:02,989 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-09-12 10:08:02,989 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-09-12 10:08:02,994 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'classification_pyt', 'gpu': ['NVIDIA-RTX-A6000'], 'success': False, 'time_lapsed': 16} to https://api.tao.ngc.nvidia.com.
[2024-09-12 10:08:04,577 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-09-12 10:08:04,578 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-09-12 10:08:04,578 - TAO Toolkit - root - WARNING] Execution status: FAIL
2024-09-12 18:08:05,614 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
To resume from a checkpoint, use the below command. Update the epoch number accordingly
Full log here